New Lookup Table (MixString)

Sep 02, 2023

Salih Dincer

Sep 02, 2023

Richard (Rikki) Andrew Cattermole

Sep 03, 2023

Salih Dincer

Sep 03, 2023

Salih Dincer

Sep 03, 2023

Richard (Rikki) Andrew Cattermole

import std.stdio, std.algorithm; import std.range, std.conv; enum alphabets { u = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ", l = "âabcçdefghıjklmnopqrstuvğiwxyzöşüçîû", ASCII65_95 = "AABCCDEFGHIJKLMNOPQRSTUVGIWXYZOSUCIU", ASCII96_127 = "aabccdefghijklmnopqrstuvgiwxyzosuciu", ASCII65_127 = ASCII65_95 ~ ASCII96_127 } //enum dictU = alphabets.u.to!(wchar[]); enum dictU = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ".to!(wchar[]); enum dictL = alphabets.l.to!(wchar[]); struct MixString(T, T[] leftLiterals) { size_t index; dchar[] dict; this(string d) { // load dictionary foreach(dchar c; d) dict ~= c; // place counterparts foreach(i, wchar c; leftLiterals) { dict[i] |= c << 16; } } // input range functions bool empty() { return index == dict.length; } T front() { return dict[index] & 0x0000_FFFF; } void popFront() { ++index; } // search elements auto nextIndexOf(wchar key) { scope(exit) index = 0; size_t i = 1; while(!empty) { if(front == key) { return i; } else i++; popFront(); } return 0; } } // alias ConvUpper = MixString!(wchar, dictU); alias ConvLower = MixString!(wchar, dictL); void main() { auto test = ConvUpper(alphabets.l);/* foreach(wchar c; test) { c.writefln!"%4X: %s"(c); }//*/ string text = "fıstıkçı şâhap bir insandır!"; foreach(wchar c; text) { if(auto result = test.nextIndexOf(c)) { wchar lookup = test.dict[result - 1] >> 16; lookup.write; } else c.write; } writeln; } /* FISTIKÇI ŞÂHAP BİR İNSANDIR! */

Lets see: O(n) search for alphabet index Limited tables, that do not scale to other languages. Tables limited to BMP. Not particularly useful generally speaking, but with some improvements it may be useful in a limited capacity. Search can be replaced with either binary search (where probability of a particular character is unknown), fibonacci search if the probability is known with a preference towards the start of the ranges. Typically for such tables, they would be implemented using a multi-level trie. With the lookup being O(1). Costs more ROM, but is well worth it for the speed. Unicode Demystified covers the standard method for doing this sort of lookup as well as how to do the case conversion correctly. https://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522

September 03, 2023

Re: New Lookup Table (MixString)

Posted by Salih Dincer
in reply to Richard (Rikki) Andrew Cattermole

Permalink

Salih Dincer

Posted in reply to Richard (Rikki) Andrew Cattermole

Permalink

On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:

Lets see:

O(n) search for alphabet index

I don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second. The conversion done includes reading from the file, finding the counterparts, and writing to the file...

For-example:

enum abece
{
  b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]),
  k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]),
  ele = "gusiocCOISUG".to!(wchar[])
}

void main()
{
  alias MSbyk = MixString!(wchar, abece.b);
  enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; // abece.k.to!string;
  auto büyük = MSbyk(bütünSözlük);

  // Source: https://archive.org/download/kutadgu-bilig-fergana-nushasi/681053_djvu.txt
  auto dosya = File("KutadguBilig.txt", "r");
  while (!dosya.eof)
  {
    foreach(wchar c; dosya.readln)
    {
      if(auto result = büyük.nextIndexOf(c))
      {
        wchar lookup = büyük.dict[result - 1] >> 16;
        lookup.write;
      } else {
        c.write;
      }
    }
    writeln;
  }
} /*
pico@enpi:~/Projeler/NewLookup$ time ./newLookupTable > result.txt

real  0m0,875s
user  0m0,859s
sys   0m0,016s
*/

On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:

Unicode Demystified covers the standard method for doing this sort of lookup as well as how to do the case conversion correctly. https://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522

Thank you, I will read the book you mentioned.

SDB@79

enum abece { b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]), k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]), ele = "gusiocCOISUG".to!(wchar[]) } void main() { alias MSbyk = MixString!(wchar, abece.b); enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; // abece.k.to!string;

On 03/09/2023 10:36 PM, Salih Dincer wrote: > I don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second. Yeah your lookup table is small enough that it won't matter. Problem is that it won't scale. Unicode as a whole is 0x10FFFF big, with the first plane being 64k (BMP). Imagine trying to throw hardware at those sort of numbers.

Forums