January 16
On Tue, Jan 16, 2024 at 09:15:19PM +0000, Renato via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 20:34:48 UTC, H. S. Teoh wrote:
> > On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...]
> > > Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below.
> > [...]
> > 
> 
> Great, I ran the benchmarks for you :)
> 
> I had to change how you accept arguments, even though you did "the right thing" using `getopt`, the solutions should just take a `count` or `print` argument first...

Oops, haha :-P


> Anyway, here's your result:
> 
> ```
> ===> ./rust
> ./rust,24133632,25
> ./rust,24739840,130
> ./rust,24477696,536
> ./rust,25247744,1064
> ./rust,8175616,6148
> ./rust,8306688,8315
> ===> src/d/dencoder
> src/d/dencoder,46055424,43
> src/d/dencoder,96337920,146
> src/d/dencoder,102350848,542
> src/d/dencoder,102268928,1032
> src/d/dencoder,40206336,99936
> ^C
> ```
> 
> It took too long with the `count` option, so I had to abort before the last run ended... there's probably some bug there, otherwise the Trie runs very fast, as I had expected.
[...]

Do you have the problematic data file handy?  I'd like to look into any potential bugs.

Also, the profiler revealed that a lot of time was spent in the GC and in small allocations.  The cause is in all likelihood the .format() call for each found match, and array append being used for the recursive calls. Getting rid of the .format ought to speed it up a bit. Will try that now...


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
January 16

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:

>

For the record (I already posted this on GitHub)... here's my current fastest solution time using the same algorithm as Rust ...

[...]

>

... what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"...

It's a GC allocations fest. Things like this make it slow:

     {
-        string digit = [digits[0]];
+        string digit = digits[0 .. 1];
         words.insertBack(digit);

And at the top is the associative array lookup (when profiling the handling of the "phones_1_000_000_with_empty.txt" input file):

    36.85%  dencoder  dencoder              [.] _aaInX
    12.38%  dencoder  dencoder              [.] void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)
     6.43%  dencoder  dencoder              [.] nothrow @nogc scope void core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange)
     4.53%  dencoder  dencoder              [.] pure @safe void std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult.popFront()
     4.08%  dencoder  dencoder              [.] _d_array_slice_copy
     2.43%  dencoder  dencoder              [.] _d_newarrayU
     2.21%  dencoder  dencoder              [.] shared nothrow @nogc @trusted void core.internal.spinlock.SpinLock.lock()
     2.00%  dencoder  dencoder              [.] pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(dchar).put(dchar)
     1.91%  dencoder  dencoder              [.] nothrow void* core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref ulong, uint, const(TypeInfo))
     1.67%  dencoder  dencoder              [.] _DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_
     1.66%  dencoder  dencoder              [.] pure nothrow @nogc scope @trusted ulong core.internal.gc.bits.GCBits.setLocked(ulong)
     1.60%  dencoder  dencoder              [.] const pure nothrow @trusted ulong object.TypeInfo_Struct.getHash(scope const(void*))
     1.53%  dencoder  dencoder              [.] nothrow @safe void core.internal.util.array.enforceRawArraysConformable(const(char[]), const(ulong), const(void[]), const(void[]), const(bool))
     1.49%  dencoder  dencoder              [.] nothrow void* core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
     1.27%  dencoder  dencoder              [.] pure nothrow @safe void std.array.Appender!(immutable(char)[]).Appender.ensureAddable(ulong)
     0.81%  dencoder  dencoder              [.] pure @property @safe dchar std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult.front()
     0.79%  dencoder  dencoder              [.] nothrow @safe void core.internal.util.array._enforceNoOverlap(const(char[]), ulong, ulong, const(ulong))
     0.74%  dencoder  dencoder              [.] nothrow void core.internal.gc.impl.conservative.gc.Pool.setBits(ulong, uint)
     0.73%  dencoder  dencoder              [.] pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult).put(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult)
     0.70%  dencoder  dencoder              [.] pure @safe ulong std.range.primitives.walkLength!(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult).walkLength(std.algorithm.iteration.FilterResult!(dencoder.main(immutable(char)[][]).__lambda12, immutable(char)[]).FilterResult)
     0.60%  dencoder  dencoder              [.] bool dencoder.lastItemIsDigit(std.container.array.Array!(immutable(char)[]).Array)
     0.54%  dencoder  dencoder              [.] _d_newarrayT
[...]
January 16
On Tue, Jan 16, 2024 at 10:15:04PM +0000, Siarhei Siamashka via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:
[...]
> > ... what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"...
> 
> It's a GC allocations fest.

Indeed.

I have just completed 2 rounds of optimizations of my version of the code, and both times the profiler also showed the problem to be excessive allocations in the inner loop.  So, I did the following optimizations:

1) Get rid of .format in the inner loop. Not only does .format cause a
lot of allocations, it is also a known performance hog.  So instead
of constructing the output string in the search function, I changed it
to take a delegate instead, and the delegate either counts the result or
prints it directly (bypassing the construction of an intermediate
string).  This improved performance quite a bit for the count-only runs,
but also wins some performance even when output is generated.  Overall,
however, this optimization only gave me some minor savings.

2) Changed the `path` parameter from string[] to string, since I didn't really need it to be an array of strings anyway. This in itself only improved performance marginally, barely noticeable, but it led to (3), which gave a huge performance boost.

3) Basically, in the earlier version of the code, the `path` parameter was appended to every time I recursed, and furthermore the same initial segment gets appended to many times with different trailers as the algorithm walks the trie. As a result, this triggers a lot of array reallocations to store the new strings.  Most of these allocations are unnecessary, because we already know that the initial segment of the string will stay constant, only the tail end changes. Furthermore, we only ever have a single instance of .path at any point in time in the algorithm.  So we could use a single buffer to hold all of these instances of .path, and simply return slices to it as we go along, overwriting the tail end each time we need to append something.

This significantly cut down on the number of allocations, and along
with (1) and (2), performance improved by about 3x (!).  It didn't
completely remove all allocations, but I'm reasonably happy with the
performance now that I probably won't try to optimize it more unless
it's still losing out to another language. ;-)

(I'm especially curious to see if this beats the Rust version. :-P)

Optimized version of the code:

-----------------------------------snip------------------------------------
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
    digitOf = [
        'E': 0,
        'J': 1, 'N': 1, 'Q': 1,
        'R': 2, 'W': 2, 'X': 2,
        'D': 3, 'S': 3, 'Y': 3,
        'F': 4, 'T': 4,
        'A': 5, 'M': 5,
        'C': 6, 'I': 6, 'V': 6,
        'B': 7, 'K': 7, 'U': 7,
        'L': 8, 'O': 8, 'P': 8,
        'G': 9, 'H': 9, 'Z': 9,
    ];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
    Trie[10] edges;
    string[] words;

    private void insert(string word, string suffix)
    {
        const(ubyte)* dig;
        while (!suffix.empty &&
               (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
        {
            suffix = suffix[1 .. $];
        }

        if (suffix.empty)
        {
            words ~= word;
            return;
        }

        auto node = new Trie;
        auto idx = *dig;
        if (edges[idx] is null)
        {
            edges[idx] = new Trie;
        }
        edges[idx].insert(word, suffix[1 .. $]);
    }

    /**
     * Insert a word into the Trie.
     *
     * Characters that don't map to any digit are ignored in building the Trie.
     * However, the original form of the word will be retained as-is in the
     * leaf node.
     */
    void insert(string word)
    {
        insert(word, word[]);
    }

    /**
     * Iterate over all words stored in this Trie.
     */
    void foreachEntry(void delegate(string path, string word) cb)
    {
        void impl(Trie node, string path = "")
        {
            if (node is null) return;
            foreach (word; node.words)
            {
                cb(path, word);
            }
            foreach (i, child; node.edges)
            {
                impl(child, path ~ cast(char)('0' + i));
            }
        }
        impl(this);
    }
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
    if (isInputRange!R & is(ElementType!R : const(char)[]))
{
    Trie result = new Trie;
    foreach (line; lines)
    {
        result.insert(line.idup);
    }
    return result;
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto app = appender!(string[]);
    dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
    assert(app.data == [
        "10: je",
        "105513: jemand",
        "107: neu",
        "1550: Name",
        "253302: Wasser",
        "35: da",
        "38: so",
        "400: Fee",
        "4021: fern",
        "4034: Fest",
        "482: Tor",
        "4824: fort",
        "4824: Torf",
        "51: an",
        "562: mir",
        "562: Mix",
        "56202: Mixer",
        "78: Bo\"",
        "783: bo\"s",
        "7857: blau",
        "7884: Boot",
        "824: Ort",
        "83: o\"d"
    ]);
}

alias MatchCallback = void delegate(const(char)[] phone, const(char)[] match);

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(Trie dict, const(char)[] phoneNumber, MatchCallback cb)
{
    /*
     * Optimization: use a common buffer for constructing the output string,
     * instead of using string append, which allocates many new strings.
     *
     * The `path` parameter passed to .impl is always a slice of .buffer,
     * either its current instance or a previous instance left over from a
     * buffer reallocation. As such, its contents are always an initial segment
     * of whatever's in .buffer, so appending is just a matter of writing to
     * the tail end of the buffer and returning a slice of the new buffer.
     *
     * The fact that .path higher up the call tree may be pointing to old
     * versions of .buffer is not a problem; contentwise they are always an
     * initial segment of the current .buffer, so any subsequent appends will
     * copy the new word to the right place in the current .buffer and return a
     * slice to it, and the contents will always be consistent.
     */
    static char[] buffer;
    const(char)[] appendPath(const(char)[] path, const(char)[] word)
    {
        // Assumption: path is an initial segment of buffer (either its current
        // incarnation or a pre-reallocated initial copy). So we don't need to
        // copy the initial segment, just whatever needs to be appended.
        if (path.length == 0)
        {
            auto newlen = word.length;
            if (buffer.length < newlen)
                buffer.length = newlen;
            buffer[0 .. newlen] = word[];
            return buffer[0 .. newlen];
        }
        else
        {
            auto newlen = path.length + 1 + word.length;
            if (buffer.length < newlen)
                buffer.length = newlen;
            buffer[path.length] = ' ';
            buffer[path.length + 1 .. newlen] = word[];
            return buffer[0 .. newlen];
        }
    }

    bool impl(Trie node, const(char)[] suffix, const(char)[] path,
              bool allowDigit)
    {
        if (node is null)
            return false;

        // Ignore non-digit characters in phone number
        while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9'))
            suffix = suffix[1 .. $];

        if (suffix.empty)
        {
            // Found a match, print result
            foreach (word; node.words)
            {
                cb(phoneNumber, appendPath(path, word));
            }
            return !node.words.empty;
        }

        bool ret;
        foreach (word; node.words)
        {
            // Found a matching word, try to match the rest of the phone
            // number.
            ret = true;
            if (impl(dict, suffix, appendPath(path, word), true))
                allowDigit = false;
        }

        if (impl(node.edges[suffix[0] - '0'], suffix[1 .. $], path, false))
        {
            allowDigit = false;
            ret = true;
        }

        if (allowDigit)
        {
            // If we got here, it means that if we take the current node as the
            // next word choice, then the following digit will have no further
            // matches, and we may encode it as a single digit.
            auto nextSuffix = suffix[1 .. $];
            if (nextSuffix.empty)
            {
                cb(phoneNumber, appendPath(path, suffix[0 .. 1]));
                ret = true;
            }
            else
            {
                if (impl(dict, suffix[1 .. $],
                         appendPath(path, suffix[0 .. 1]), false))
                    ret = true;
            }
        }
        return ret;
    }

    // Trim trailing non-digits from phone number
    auto suffix = phoneNumber[];
    while (!suffix.empty && (suffix[$-1] < '0' || suffix[$-1] > '9'))
    {
        suffix = suffix[0 .. $-1];
    }

    impl(dict, suffix, buffer[0 .. 0], true);
}

/**
 * Encode the given input range of phone numbers according to the given
 * dictionary, writing the output to the given sink.
 */
void encodePhoneNumbers(R)(R input, Trie dict, MatchCallback cb)
    if (isInputRange!R & is(ElementType!R : const(char)[]))
{
    foreach (line; input)
    {
        findMatches(dict, line, cb);
    }
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto input = [
        "112",
        "5624-82",
        "4824",
        "0721/608-4067",
        "10/783--5",
        "1078-913-5",
        "381482",
        "04824",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (phone, match) {
        app.put(format("%s: %s", phone, match));
    });

    //writefln("\n%-(%s\n%)", app.data);
    assert(app.data.sort.release == [
        "04824: 0 Tor 4",
        "04824: 0 Torf",
        "04824: 0 fort",
        "10/783--5: je Bo\" da",
        "10/783--5: je bo\"s 5",
        "10/783--5: neu o\"d 5",
        "381482: so 1 Tor",
        "4824: Tor 4",
        "4824: Torf",
        "4824: fort",
        "5624-82: Mix Tor",
        "5624-82: mir Tor",
    ]);
}

unittest
{
    auto dict = loadDictionary(q"ENDDICT
Bias
ja
Mai
Reck
Weib
USA
ENDDICT".splitLines);

    auto input = [
        "/7-357653152/0677-",
        "/7-3576-",
        "/8-",
        "1556/0",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (phone, match) {
        app.put(format("%s: %s", phone, match));
    });

    //writefln("\n%-(%s\n%)", app.data);
    assert(app.data.sort.release == [
        "/7-357653152/0677-: USA Bias ja Reck 7",
        "/7-357653152/0677-: USA Bias ja Weib 7",
        "/8-: 8",

        /* Note: 1556/0 should NOT encode as "1 Mai 0" because the initial "15"
         * matches "ja", thus excluding a digit in that position. */
    ]);
}

/**
 * Program entry point.
 */
int main(string[] args)
{
    File input = stdin;
    bool countOnly;

    auto info = getopt(args,
        "c|count", "Count solutions only", &countOnly,
    );

    int showHelp()
    {
        stderr.writefln("Usage: %s [options] <dictfile> [<inputfile>]",
                        args[0]);
        defaultGetoptPrinter("", info.options);
        return 1;
    }

    if (info.helpWanted || args.length < 2)
        return showHelp();

    auto dictfile = args[1];
    if (args.length > 2)
        input = File(args[2]);

    Trie dict = loadDictionary(File(dictfile).byLine);

    if (countOnly)
    {
        size_t count;
        encodePhoneNumbers(input.byLine, dict, (phone, match) { count++; });
        writefln("Number of solutions: %d", count);
    }
    else
    {
        encodePhoneNumbers(input.byLine, dict, (phone, match) {
            writefln("%s: %s", phone, match);
        });
    }

    return 0;
}
-----------------------------------snip------------------------------------


T

-- 
People tell me that I'm paranoid, but they're just out to get me.
January 17

On Tuesday, 16 January 2024 at 22:15:04 UTC, Siarhei Siamashka wrote:

>

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:

>

For the record (I already posted this on GitHub)... here's my current fastest solution time using the same algorithm as Rust ...

[...]

>

... what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"...

It's a GC allocations fest. Things like this make it slow:

     {
-        string digit = [digits[0]];
+        string digit = digits[0 .. 1];
         words.insertBack(digit);

I was under the impression that [digits[0]] would just use a stack allocation??

The profiler does not show any GC anymore, are you sure it's a "GC allocations fest"???

>

And at the top is the associative array lookup (when profiling the handling of the "phones_1_000_000_with_empty.txt" input file):

    36.85%  dencoder  dencoder              [.] _aaInX
    12.38%  dencoder  dencoder              [.] void

Well, I know, but I did everything to make the hash "cheap" to compute so I still don't see a way to improve it.

January 17

On Tuesday, 16 January 2024 at 23:43:16 UTC, H. S. Teoh wrote:

>

(I'm especially curious to see if this beats the Rust version. :-P)

That's just irrelevant when you're using a completely different algorithm.

My conclusion in the study was exactly that: much more important than the language is the algorithm. The language does matter, but much, much less.

If you want to check your performance, you know you can run the ./benchmark.sh yourself?

January 17

On Tuesday, 16 January 2024 at 22:13:55 UTC, H. S. Teoh wrote:

>

used for the recursive calls. Getting rid of the .format ought to speed it up a bit. Will try that now...

That will make no difference for the count option which is where your solution was very slow. To run the slow test manually use the words_quarter.txt dictionary (the phone numbers file doesn't matter much - it's all in the dictionary).

But pls run the benchmarks yourself as I am not going to keep running it for you, and would be nice if you posted your solution on a Gist for example, pasting lots of code in the forum makes it difficult to follow.

January 17

On Wednesday, 17 January 2024 at 07:11:02 UTC, Renato wrote:

>

If you want to check your performance, you know you can run the ./benchmark.sh yourself?

Out of curiosity I've tried to manually run this on Windows and it seems that Java generator for these numbers files is "broken", the resulting count or print runs fine for both Java and D versions provided in your D branch, but fails with generated files.

D version complains about bad utf8 encoding.
I've opened the generated file in text editor and it is UTF-16 (little-endian with BOM).

Tried with adoptium jdk 17 and 21 (former openjdk), but I guess it doesn't matter since UTF-16 is default on Windows.

January 17

On Wednesday, 17 January 2024 at 07:06:25 UTC, Renato wrote:

>

On Tuesday, 16 January 2024 at 22:15:04 UTC, Siarhei Siamashka wrote:

>

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:

It's a GC allocations fest. Things like this make it slow:

     {
-        string digit = [digits[0]];
+        string digit = digits[0 .. 1];
         words.insertBack(digit);

I was under the impression that [digits[0]] would just use a stack allocation??

The profiler does not show any GC anymore, are you sure it's a "GC allocations fest"???

nah, you are allocating new array out of single digit while the latter is just takes a slice.

there is 'scope' storage specifier for when you know your variable won't escape the scope to place it on stack (works for classes too), but I'm not sure if it will work with array.

scope string digit = [digits[0]];

January 17

On Wednesday, 17 January 2024 at 09:15:02 UTC, evilrat wrote:

>

On Wednesday, 17 January 2024 at 07:11:02 UTC, Renato wrote:

>

If you want to check your performance, you know you can run the ./benchmark.sh yourself?

Out of curiosity I've tried to manually run this on Windows and it seems that Java generator for these numbers files is "broken", the resulting count or print runs fine for both Java and D versions provided in your D branch, but fails with generated files.

D version complains about bad utf8 encoding.
I've opened the generated file in text editor and it is UTF-16 (little-endian with BOM).

Tried with adoptium jdk 17 and 21 (former openjdk), but I guess it doesn't matter since UTF-16 is default on Windows.

It's not Java writing the file, it's the bash script benchmark.sh:

java -cp "build/util" util.GeneratePhoneNumbers 1000 > phones_1000.txt

Java is just printing to stdout. I wasn't aware that piping like this would use the OS default encoding. Unfortunately I don't have Windows here, but try to change the encoding used by the bash script maybe?

January 17

On Wednesday, 17 January 2024 at 10:24:31 UTC, Renato wrote:

>

It's not Java writing the file, it's the bash script benchmark.sh:

java -cp "build/util" util.GeneratePhoneNumbers 1000 > phones_1000.txt

Perhaps using this option when running Java will help:

java -DFile.Encoding=UTF-8 ...