Thread overview
September 02

Hi,

This is an InputRange and RandomAccessRange combined; it's also the placement of wchar in possible null parts of dchar.

Please criticize this code. Each element of a UTF string is matched to its counterparts as dchar and used as wchar. The code is self-explanatory, do you think it's useful?

import std.stdio, std.algorithm;
import std.range, std.conv;

enum alphabets
{
  u = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ",
  l = "âabcçdefghıjklmnopqrstuvğiwxyzöşüçîû",

  ASCII65_95 = "AABCCDEFGHIJKLMNOPQRSTUVGIWXYZOSUCIU",
  ASCII96_127 = "aabccdefghijklmnopqrstuvgiwxyzosuciu",
  ASCII65_127 = ASCII65_95 ~ ASCII96_127
}

//enum dictU = alphabets.u.to!(wchar[]);
enum dictU = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ".to!(wchar[]);
enum dictL = alphabets.l.to!(wchar[]);

struct MixString(T, T[] leftLiterals)
{
  size_t index;
  dchar[] dict;

  this(string d)
  {
    // load dictionary
    foreach(dchar c; d) dict ~= c;

    // place counterparts
    foreach(i, wchar c; leftLiterals)
    {
      dict[i] |= c << 16;
    }
  }

  // input range functions
  bool empty() { return index == dict.length; }
  T front() { return dict[index] & 0x0000_FFFF; }
  void popFront() { ++index; }

  // search elements
  auto nextIndexOf(wchar key)
  {
    scope(exit) index = 0;

    size_t i = 1;
    while(!empty)
    {
      if(front == key)
      {
        return i;
      } else i++;
      popFront();
    }
    return 0;
  }
}
//
alias ConvUpper = MixString!(wchar, dictU);
alias ConvLower = MixString!(wchar, dictL);

void main()
{
  auto test = ConvUpper(alphabets.l);/*
  foreach(wchar c; test)
  {
    c.writefln!"%4X: %s"(c);
  }//*/

  string text = "fıstıkçı şâhap bir insandır!";
  foreach(wchar c; text)
  {
    if(auto result = test.nextIndexOf(c))
    {
      wchar lookup = test.dict[result - 1] >> 16;
      lookup.write;
    } else
      c.write;
  }
  writeln;
}
/*

FISTIKÇI ŞÂHAP BİR İNSANDIR!

*/

SDB@79

September 03
Lets see:

O(n) search for alphabet index

Limited tables, that do not scale to other languages.

Tables limited to BMP.

Not particularly useful generally speaking, but with some improvements it may be useful in a limited capacity.



Search can be replaced with either binary search (where probability of a particular character is unknown), fibonacci search if the probability is known with a preference towards the start of the ranges.

Typically for such tables, they would be implemented using a multi-level trie. With the lookup being O(1). Costs more ROM, but is well worth it for the speed.

Unicode Demystified covers the standard method for doing this sort of lookup as well as how to do the case conversion correctly. https://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522
September 03

On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:

>

Lets see:

O(n) search for alphabet index

I don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second. The conversion done includes reading from the file, finding the counterparts, and writing to the file...

For-example:

enum abece
{
  b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]),
  k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]),
  ele = "gusiocCOISUG".to!(wchar[])
}

void main()
{
  alias MSbyk = MixString!(wchar, abece.b);
  enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; // abece.k.to!string;
  auto büyük = MSbyk(bütünSözlük);

  // Source: https://archive.org/download/kutadgu-bilig-fergana-nushasi/681053_djvu.txt
  auto dosya = File("KutadguBilig.txt", "r");
  while (!dosya.eof)
  {
    foreach(wchar c; dosya.readln)
    {
      if(auto result = büyük.nextIndexOf(c))
      {
        wchar lookup = büyük.dict[result - 1] >> 16;
        lookup.write;
      } else {
        c.write;
      }
    }
    writeln;
  }
} /*
pico@enpi:~/Projeler/NewLookup$ time ./newLookupTable > result.txt

real  0m0,875s
user  0m0,859s
sys   0m0,016s
*/

On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:

>

Unicode Demystified covers the standard method for doing this sort of lookup as well as how to do the case conversion correctly. https://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522

Thank you, I will read the book you mentioned.

SDB@79

September 03

On Sunday, 3 September 2023 at 10:36:58 UTC, Salih Dincer wrote:

>

For-example:

enum abece
{
  b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]),
  k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]),
  ele = "gusiocCOISUG".to!(wchar[])
}

void main()
{
  alias MSbyk = MixString!(wchar, abece.b);
  enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; // abece.k.to!string;
I wonder why I can't use abece.k directly. The error it gives is as follows:

> core.exception.ArrayIndexError@newLookupTable.d(39):
> index [1] is out of bounds for array of length 1

SDB@79


September 04
On 03/09/2023 10:36 PM, Salih Dincer wrote:
> I don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second.

Yeah your lookup table is small enough that it won't matter.

Problem is that it won't scale. Unicode as a whole is 0x10FFFF big, with the first plane being 64k (BMP). Imagine trying to throw hardware at those sort of numbers.