Help optimize D solution to phone encoding problem: extremely slow performace. (page 2)

Settings

Help

Index » Learn » Help optimize D solution to phone encoding problem: extremely slow performace. (page 2)

January 15

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Renato
in reply to Sergey

Permalink

Renato

Posted in reply to Sergey

Permalink

On Monday, 15 January 2024 at 01:10:14 UTC, Sergey wrote:

On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote:

If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in betterC style?!

I've added port from Rust in the PR comment. Can you please check this solution?
Most probably it need to be optimized with profiler. Just interesting how close-enough port will work.

As discussed on GitHub, the line-by-line port of the Rust code is 5x slower than my latest solution using int128, which is itself 3 to 4x slower than the Rust implementation (at around the same order of magnitude as algorithm-equivalent Java and Common Lisp implementations, D is perhaps 15% faster).

I did the best I could to make D run faster, but we hit a limit that's a bit hard to get past now. Happy to be given suggestions (see profiling information in previous messages), but I've run out of ideas myself.

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by H. S. Teoh
in reply to Renato

Permalink

H. S. Teoh

Posted in reply to Renato

Permalink

On Mon, Jan 15, 2024 at 08:10:55PM +0000, Renato via Digitalmars-d-learn wrote:
> On Monday, 15 January 2024 at 01:10:14 UTC, Sergey wrote:
> > On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote:
> > > If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in `betterC` style?!
> > 
> > I've added port from Rust in the PR comment. Can you please check
> > this solution?
> > Most probably it need to be optimized with profiler. Just
> > interesting how close-enough port will work.
> 
> As discussed on GitHub, the line-by-line port of the Rust code is 5x
> slower than [my latest solution using
> int128](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d),
> which is itself 3 to 4x slower than the Rust implementation (at around
> the same order of magnitude as algorithm-equivalent Java and Common
> Lisp implementations, D is perhaps 15% faster).
> 
> I did the best I could to make D run faster, but we hit a limit that's a bit hard to get past now. Happy to be given suggestions (see profiling information in previous messages), but I've run out of ideas myself.

This problem piqued my interest, so yesterday and today I worked on it and came up with my own solution (I did not look at existing solutions in order to prevent bias).  I have not profiled it or anything, but the runtime seems quite promising.

Here it is:

---------------------------------snip----------------------------------
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
    digitOf = [
        'E': 0,
        'J': 1, 'N': 1, 'Q': 1,
        'R': 2, 'W': 2, 'X': 2,
        'D': 3, 'S': 3, 'Y': 3,
        'F': 4, 'T': 4,
        'A': 5, 'M': 5,
        'C': 6, 'I': 6, 'V': 6,
        'B': 7, 'K': 7, 'U': 7,
        'L': 8, 'O': 8, 'P': 8,
        'G': 9, 'H': 9, 'Z': 9,
    ];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
    Trie[10] edges;
    string[] words;

    private void insert(string word, string suffix)
    {
        const(ubyte)* dig;
        while (!suffix.empty &&
               (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
        {
            suffix = suffix[1 .. $];
        }

        if (suffix.empty)
        {
            words ~= word;
            return;
        }

        auto node = new Trie;
        auto idx = *dig;
        if (edges[idx] is null)
        {
            edges[idx] = new Trie;
        }
        edges[idx].insert(word, suffix[1 .. $]);
    }

    /**
     * Insert a word into the Trie.
     *
     * Characters that don't map to any digit are ignored in building the Trie.
     * However, the original form of the word will be retained as-is in the
     * leaf node.
     */
    void insert(string word)
    {
        insert(word, word[]);
    }

    /**
     * Iterate over all words stored in this Trie.
     */
    void foreachEntry(void delegate(string path, string word) cb)
    {
        void impl(Trie node, string path = "")
        {
            if (node is null) return;
            foreach (word; node.words)
            {
                cb(path, word);
            }
            foreach (i, child; node.edges)
            {
                impl(child, path ~ cast(char)('0' + i));
            }
        }
        impl(this);
    }
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
    if (isInputRange!R & is(ElementType!R : const(char)[]))
{
    Trie result = new Trie;
    foreach (line; lines)
    {
        result.insert(line.idup);
    }
    return result;
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto app = appender!(string[]);
    dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
    assert(app.data == [
        "10: je",
        "105513: jemand",
        "107: neu",
        "1550: Name",
        "253302: Wasser",
        "35: da",
        "38: so",
        "400: Fee",
        "4021: fern",
        "4034: Fest",
        "482: Tor",
        "4824: fort",
        "4824: Torf",
        "51: an",
        "562: mir",
        "562: Mix",
        "56202: Mixer",
        "78: Bo\"",
        "783: bo\"s",
        "7857: blau",
        "7884: Boot",
        "824: Ort",
        "83: o\"d"
    ]);
}

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)
    if (isOutputRange!(W, string))
{
    bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit)
    {
        if (node is null)
            return false;

        // Ignore non-digit characters in phone number
        while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9'))
            suffix = suffix[1 .. $];

        if (suffix.empty)
        {
            // Found a match, print result
            foreach (word; node.words)
            {
                put(sink, format("%s: %-(%s %)", phoneNumber,
                                 path.chain(only(word))));
            }
            return !node.words.empty;
        }

        bool ret;
        foreach (word; node.words)
        {
            // Found a matching word, try to match the rest of the phone
            // number.
            if (impl(dict, suffix, path ~ word, true))
            {
                allowDigit = false;
                ret = true;
            }
        }

        if (impl(node.edges[suffix[0] - '0'], suffix[1 .. $], path, false))
        {
            allowDigit = false;
            ret = true;
        }

        if (allowDigit)
        {
            // If we got here, it means that if we take the current node as the
            // next word choice, then the following digit will have no further
            // matches, and we may encode it as a single digit.
            auto nextSuffix = suffix[1 .. $];
            if (nextSuffix.empty)
            {
                put(sink, format("%s: %-(%s %)", phoneNumber,
                                 path.chain(suffix[0 .. 1].only)));
                ret = true;
            }
            else
            {
                if (impl(dict, suffix[1 .. $], path ~ [ suffix[0] ], false))
                    ret = true;
            }
        }
        return ret;
    }

    impl(dict, phoneNumber[], [], true);
}

/**
 * Encode the given input range of phone numbers according to the given
 * dictionary, writing the output to the given sink.
 */
void encodePhoneNumbers(R,W)(R input, Trie dict, W sink)
    if (isInputRange!R & is(ElementType!R : const(char)[]) &&
        isOutputRange!(W, string))
{
    foreach (line; input)
    {
        findMatches(dict, line, sink);
    }
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto input = [
        "112",
        "5624-82",
        "4824",
        "0721/608-4067",
        "10/783--5",
        "1078-913-5",
        "381482",
        "04824",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (string match) { app.put(match); });

    //writefln("%-(%s\n%)", app.data);
    assert(app.data.sort.release == [
        "04824: 0 Tor 4",
        "04824: 0 Torf",
        "04824: 0 fort",
        "10/783--5: je Bo\" da",
        "10/783--5: je bo\"s 5",
        "10/783--5: neu o\"d 5",
        "381482: so 1 Tor",
        "4824: Tor 4",
        "4824: Torf",
        "4824: fort",
        "5624-82: Mix Tor",
        "5624-82: mir Tor",
    ]);
}

/**
 * Program entry point.
 */
int main(string[] args)
{
    File input = stdin;
    bool countOnly;

    auto info = getopt(args,
        "c|count", "Count solutions only", &countOnly,
    );

    int showHelp()
    {
        stderr.writefln("Usage: %s [options] <dictfile> [<inputfile>]",
                        args[0]);
        defaultGetoptPrinter("", info.options);
        return 1;
    }

    if (info.helpWanted || args.length < 2)
        return showHelp();

    auto dictfile = args[1];
    if (args.length > 2)
        input = File(args[2]);

    Trie dict = loadDictionary(File(dictfile).byLine);

    if (countOnly)
    {
        size_t count;
        encodePhoneNumbers(input.byLine, dict, (string match) { count++; });
        writefln("Number of solutions: %d", count);
    }
    else
    {
        encodePhoneNumbers(input.byLine, dict, (string match) {
            writeln(match);
        });
    }

    return 0;
}
---------------------------------snip----------------------------------

The underlying idea is to store the dictionary entries in a trie (radix tree) keyed by the prescribed digit mapping. Once this is done, encoding a phone number is as simple as following each digit down the trie and enumerating words along the path.  This means that once the dictionary is encoded, we no longer need to do any digit mappings; the digits of the input phone number is enough to determine the path down the trie.

Of course, pursuant to the original specifications of the problem, the original word forms are stored as-is along each node corresponding with the end of the word, so once a match is found the program simply enumerates the words found at that node.

//

Unfortunately there seems to be some discrepancy between the output I got and the prescribed output in your repository. For example, in your output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" a valid encoding according to your dictionary and the original problem description?


T

-- 
Engineering almost by definition should break guidelines. -- Ali Çehreli

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by H. S. Teoh

Permalink

H. S. Teoh

Permalink

On Tue, Jan 16, 2024 at 07:50:35AM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...]
> Unfortunately there seems to be some discrepancy between the output I got and the prescribed output in your repository. For example, in your output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" a valid encoding according to your dictionary and the original problem description?
[...]

Also, found a bug in my program that misses some solutions when the phone number has trailing non-digits. Here's the updated code.  It still finds extra encodings from the output in your repo, though.  Maybe I misunderstood part of the requirements?


------------------------------snip--------------------------------------
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
    digitOf = [
        'E': 0,
        'J': 1, 'N': 1, 'Q': 1,
        'R': 2, 'W': 2, 'X': 2,
        'D': 3, 'S': 3, 'Y': 3,
        'F': 4, 'T': 4,
        'A': 5, 'M': 5,
        'C': 6, 'I': 6, 'V': 6,
        'B': 7, 'K': 7, 'U': 7,
        'L': 8, 'O': 8, 'P': 8,
        'G': 9, 'H': 9, 'Z': 9,
    ];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
    Trie[10] edges;
    string[] words;

    private void insert(string word, string suffix)
    {
        const(ubyte)* dig;
        while (!suffix.empty &&
               (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
        {
            suffix = suffix[1 .. $];
        }

        if (suffix.empty)
        {
            words ~= word;
            return;
        }

        auto node = new Trie;
        auto idx = *dig;
        if (edges[idx] is null)
        {
            edges[idx] = new Trie;
        }
        edges[idx].insert(word, suffix[1 .. $]);
    }

    /**
     * Insert a word into the Trie.
     *
     * Characters that don't map to any digit are ignored in building the Trie.
     * However, the original form of the word will be retained as-is in the
     * leaf node.
     */
    void insert(string word)
    {
        insert(word, word[]);
    }

    /**
     * Iterate over all words stored in this Trie.
     */
    void foreachEntry(void delegate(string path, string word) cb)
    {
        void impl(Trie node, string path = "")
        {
            if (node is null) return;
            foreach (word; node.words)
            {
                cb(path, word);
            }
            foreach (i, child; node.edges)
            {
                impl(child, path ~ cast(char)('0' + i));
            }
        }
        impl(this);
    }
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
    if (isInputRange!R & is(ElementType!R : const(char)[]))
{
    Trie result = new Trie;
    foreach (line; lines)
    {
        result.insert(line.idup);
    }
    return result;
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto app = appender!(string[]);
    dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
    assert(app.data == [
        "10: je",
        "105513: jemand",
        "107: neu",
        "1550: Name",
        "253302: Wasser",
        "35: da",
        "38: so",
        "400: Fee",
        "4021: fern",
        "4034: Fest",
        "482: Tor",
        "4824: fort",
        "4824: Torf",
        "51: an",
        "562: mir",
        "562: Mix",
        "56202: Mixer",
        "78: Bo\"",
        "783: bo\"s",
        "7857: blau",
        "7884: Boot",
        "824: Ort",
        "83: o\"d"
    ]);
}

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)
    if (isOutputRange!(W, string))
{
    bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit)
    {
        if (node is null)
            return false;

        // Ignore non-digit characters in phone number
        while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9'))
            suffix = suffix[1 .. $];

        if (suffix.empty)
        {
            // Found a match, print result
            foreach (word; node.words)
            {
                put(sink, format("%s: %-(%s %)", phoneNumber,
                                 path.chain(only(word))));
            }
            return !node.words.empty;
        }

        bool ret;
        foreach (word; node.words)
        {
            // Found a matching word, try to match the rest of the phone
            // number.
            if (impl(dict, suffix, path ~ word, true))
            {
                allowDigit = false;
                ret = true;
            }
        }

        if (impl(node.edges[suffix[0] - '0'], suffix[1 .. $], path, false))
        {
            allowDigit = false;
            ret = true;
        }

        if (allowDigit)
        {
            // If we got here, it means that if we take the current node as the
            // next word choice, then the following digit will have no further
            // matches, and we may encode it as a single digit.
            auto nextSuffix = suffix[1 .. $];
            if (nextSuffix.empty)
            {
                put(sink, format("%s: %-(%s %)", phoneNumber,
                                 path.chain(suffix[0 .. 1].only)));
                ret = true;
            }
            else
            {
                if (impl(dict, suffix[1 .. $], path ~ [ suffix[0] ], false))
                    ret = true;
            }
        }
        return ret;
    }

    // Trim trailing non-digits from phone number
    auto suffix = phoneNumber[];
    while (!suffix.empty && (suffix[$-1] < '0' || suffix[$-1] > '9'))
    {
        suffix = suffix[0 .. $-1];
    }

    impl(dict, suffix, [], true);
}

/**
 * Encode the given input range of phone numbers according to the given
 * dictionary, writing the output to the given sink.
 */
void encodePhoneNumbers(R,W)(R input, Trie dict, W sink)
    if (isInputRange!R & is(ElementType!R : const(char)[]) &&
        isOutputRange!(W, string))
{
    foreach (line; input)
    {
        findMatches(dict, line, sink);
    }
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto input = [
        "112",
        "5624-82",
        "4824",
        "0721/608-4067",
        "10/783--5",
        "1078-913-5",
        "381482",
        "04824",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (string match) { app.put(match); });

    //writefln("%-(%s\n%)", app.data);
    assert(app.data.sort.release == [
        "04824: 0 Tor 4",
        "04824: 0 Torf",
        "04824: 0 fort",
        "10/783--5: je Bo\" da",
        "10/783--5: je bo\"s 5",
        "10/783--5: neu o\"d 5",
        "381482: so 1 Tor",
        "4824: Tor 4",
        "4824: Torf",
        "4824: fort",
        "5624-82: Mix Tor",
        "5624-82: mir Tor",
    ]);
}

unittest
{
    auto dict = loadDictionary(q"ENDDICT
Bias
ja
Reck
Weib
USA
ENDDICT".splitLines);

    auto input = [
        "/7-357653152/0677-",
        "/8-",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (string match) { app.put(match); });

    assert(app.data.sort.release == [
        "/7-357653152/0677-: USA Bias ja Reck 7",
        "/7-357653152/0677-: USA Bias ja Weib 7",
        "/8-: 8",
    ]);
}

/**
 * Program entry point.
 */
int main(string[] args)
{
    File input = stdin;
    bool countOnly;

    auto info = getopt(args,
        "c|count", "Count solutions only", &countOnly,
    );

    int showHelp()
    {
        stderr.writefln("Usage: %s [options] <dictfile> [<inputfile>]",
                        args[0]);
        defaultGetoptPrinter("", info.options);
        return 1;
    }

    if (info.helpWanted || args.length < 2)
        return showHelp();

    auto dictfile = args[1];
    if (args.length > 2)
        input = File(args[2]);

    Trie dict = loadDictionary(File(dictfile).byLine);

    if (countOnly)
    {
        size_t count;
        encodePhoneNumbers(input.byLine, dict, (string match) { count++; });
        writefln("Number of solutions: %d", count);
    }
    else
    {
        encodePhoneNumbers(input.byLine, dict, (string match) {
            writeln(match);
        });
    }

    return 0;
}
------------------------------snip--------------------------------------


T

-- 
"You are a very disagreeable person." "NO."

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by H. S. Teoh

Permalink

H. S. Teoh

Permalink

P.S. Compiling my program with `ldc -O2`, it runs so fast that I couldn't measure any meaningful running time that's greater than startup overhead.  So I wrote a helper program to generate random phone numbers up to 50 characters long, and found that it could encode 1 million phone numbers in 2.2 seconds (using the 75,000 entry dictionary from your repository).  Counting vs. printing the results made no significant difference to this.


T

-- 
People tell me that I'm skeptical, but I don't believe them.

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Siarhei Siamashka
in reply to H. S. Teoh

Permalink

Siarhei Siamashka

Posted in reply to H. S. Teoh

Permalink

On Tuesday, 16 January 2024 at 15:50:35 UTC, H. S. Teoh wrote:
> Unfortunately there seems to be some discrepancy between the output I got and the prescribed output in your repository. For example, in your output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" a valid encoding according to your dictionary and the original problem description?

You are not allowed to emit "1" as the first token in the output as long as there are any dictionary word matches at that position. The relevant paragraph from the problem statement:

==snip==
Encodings of phone numbers can consist of a single word or of multiple
words separated by spaces. The encodings are built word by word from
left to right. If and only if at a particular point no word at all from
the dictionary can be inserted, a single digit from the phone number can
be copied to the encoding instead. Two subsequent digits are never
allowed, though. To put it differently: In a partial encoding that
currently covers k digits, digit k+1 is encoded by itself if and only if,
first, digit k was not encoded by a digit and, second, there is no word
in the dictionary that can be used in the encoding starting at digit k+1.
==snip==

I also spent a bit of time trying to figure out this nuance when implementing my solution. It doesn't make much sense visually (no back-to-back digits in the output either way), but that's how it is.

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Renato
in reply to Siarhei Siamashka

Permalink

Renato

Posted in reply to Siarhei Siamashka

Permalink

On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote:
> On Tuesday, 16 January 2024 at 15:50:35 UTC, H. S. Teoh wrote:
>> Unfortunately there seems to be some discrepancy between the output I got and the prescribed output in your repository. For example, in your output the number 1556/0 does not have an encoding, but isn't "1 Mai 0" a valid encoding according to your dictionary and the original problem description?
>
> You are not allowed to emit "1" as the first token in the output as long as there are any dictionary word matches at that position. The relevant paragraph from the problem statement:
>
> ==snip==
> Encodings of phone numbers can consist of a single word or of multiple
> words separated by spaces. The encodings are built word by word from
> left to right. If and only if at a particular point no word at all from
> the dictionary can be inserted, a single digit from the phone number can
> be copied to the encoding instead. Two subsequent digits are never
> allowed, though. To put it differently: In a partial encoding that
> currently covers k digits, digit k+1 is encoded by itself if and only if,
> first, digit k was not encoded by a digit and, second, there is no word
> in the dictionary that can be used in the encoding starting at digit k+1.
> ==snip==
>
> I also spent a bit of time trying to figure out this nuance when implementing my solution. It doesn't make much sense visually (no back-to-back digits in the output either way), but that's how it is.

Exactly, this is one of the things that make this problem a bit annoying to solve :)

@"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's also what I did when I "participated" in the study. Here's [my Trie solution in Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java).

These are basically the two common approaches to the problem: a Trie or a numeric-based table. According to the study, people who use scripting languages almost always go with the numeric approach, while people coming from lower level languages tend to use a data structure like Trie (if they don't know Trie, they come up with something similar which is fascinating), which is harder to implement but more efficient in general.

Can I ask you why didn't you use the [D stdlib Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure that would've worked, but did you consider that?

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by H. S. Teoh
in reply to Renato

Permalink

H. S. Teoh

Posted in reply to Renato

Permalink

On Tue, Jan 16, 2024 at 06:54:56PM +0000, Renato via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote:
[...]
> > You are not allowed to emit "1" as the first token in the output as long as there are any dictionary word matches at that position. The relevant paragraph from the problem statement:

Ohhh now I get it.  Initially I misunderstood that as saying that if the rest of the phone number has at least one match, then a digit is not allowed.  Now I see that what it's actually saying is that even if some random dictionary word matches at that position, even if it does not lead to any full matches, then a digit is excluded.

[...]
> > I also spent a bit of time trying to figure out this nuance when implementing my solution. It doesn't make much sense visually (no back-to-back digits in the output either way), but that's how it is.
> 
> Exactly, this is one of the things that make this problem a bit annoying to solve :)

It's a strange requirement, for sure, but I don't think it's annoying.
It makes the problem more Interesting(tm). ;-)

Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below. Interestingly enough, the running time has now halved to about 0.9 seconds for 1 million phone numbers. I guess that's caused by the more stringent requirement excluding many more matching possibilities, effectively pruning away large parts of the search tree.

> @"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's
> also what I did when I "participated" in the study. Here's [my Trie
> solution in
> Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java).
> 
> These are basically the two common approaches to the problem: a Trie or a numeric-based table. According to the study, people who use scripting languages almost always go with the numeric approach, while people coming from lower level languages tend to use a data structure like Trie (if they don't know Trie, they come up with something similar which is fascinating), which is harder to implement but more efficient in general.

Interesting.  I guess my C/C++ background is showing. ;-)

I'm not sure what exactly motivated me to go this route; I guess it was just my default preference of choosing the path of least work as far as the algorithm is concerned: I chose the algorithm that did the least amount of work needed to produce the right answer.  Scanning through sections of the dictionary to find a match was therefore excluded; so my first thought was an AA. But then how much of the initial prefix to look up an in AA?  Since it can't be known beforehand, I'd have to gradually lengthen the prefix to search for, which does a lot of repetitive work (we keep looking up the first few digits repeatedly each time we search for a longer prefix). Plus, multiple consecutive AA lookups is not cache-friendly.  So my next thought was, what should I do such that I don't have to look at the initial digits anymore once I already processed it?  This line of thought naturally led to a trie structure.

Once I arrived at a trie structure, the next question was how exactly dictionary entries would be stored in it.  Again, in the vein of doing the least amount of work I could get away with, I thought, if I stored words in the trie directly, with each edge encoding a letter, then during the search I'd have to repeatedly convert letters to the corresponding phone number digit and vice versa.  So why not do this conversion beforehand, and store only phone digits in the trie?  This also had the additional benefit of letting me effectively search multiple letters simultaneously, since multiple letters map to the same digit, so scanning a digit is equivalent to searching multiple letters at the same time.  The output, of course, required the original form of the words -- so the obvious solution was to attach the original words as a list of words attached to the trie node representing the end of that word.

Once this was all decided, the only remaining question was the search algorithm. This turned out to take the most time in solving this problem, due to the recursive nature of the search, I had to grapple with where and how to make the recursive calls, and how to propagate return values correctly.  The initial implementation only found word matches, and did not allow the single digits.  Happily, the recursive algorithm turned out to have enough structure to encode the single digit requirements as well, although it took a bit of trial and error to find the correct implementation.

> Can I ask you why didn't you use the [D stdlib Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure that would've worked, but did you consider that?

Haha, I didn't even think of that. :-D  I wouldn't have wanted to use it anyway, because it was optimized for single codepoint lookups, and wouldn't have matched what I wanted to do. Besides, I also understood it as an internal helper structure used by std.uni, so it isn't really designed as a trie implementation for general use.

Also, since performance was an issue in your OP, I wanted to avoid off-the-bat any hidden costs that may be present in a pre-existing trie implementation that I did not fully understand.  Maybe I'm just affected with NIH syndrome, but IME, when performance is important you usually want to write a custom data structure for your inner loop, one that's fine-tuned to your specific algorithm.  Otherwise the impedance mismatch may give you suboptimal results.

T

-- 
Knowledge is that area of ignorance that we arrange and classify. -- Ambrose Bierce

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by H. S. Teoh

Permalink

H. S. Teoh

Permalink

On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...]
> Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below.
[...]

Oops, forgot to actually paste the code. Here it is:

------------------------------------snip------------------------------------
/**
 * Encoding phone numbers according to a dictionary.
 */
import std;

/**
 * Table of digit mappings.
 */
static immutable ubyte[dchar] digitOf;
shared static this()
{
    digitOf = [
        'E': 0,
        'J': 1, 'N': 1, 'Q': 1,
        'R': 2, 'W': 2, 'X': 2,
        'D': 3, 'S': 3, 'Y': 3,
        'F': 4, 'T': 4,
        'A': 5, 'M': 5,
        'C': 6, 'I': 6, 'V': 6,
        'B': 7, 'K': 7, 'U': 7,
        'L': 8, 'O': 8, 'P': 8,
        'G': 9, 'H': 9, 'Z': 9,
    ];
}

/**
 * Trie for storing dictionary words according to the phone number mapping.
 */
class Trie
{
    Trie[10] edges;
    string[] words;

    private void insert(string word, string suffix)
    {
        const(ubyte)* dig;
        while (!suffix.empty &&
               (dig = std.ascii.toUpper(suffix[0]) in digitOf) is null)
        {
            suffix = suffix[1 .. $];
        }

        if (suffix.empty)
        {
            words ~= word;
            return;
        }

        auto node = new Trie;
        auto idx = *dig;
        if (edges[idx] is null)
        {
            edges[idx] = new Trie;
        }
        edges[idx].insert(word, suffix[1 .. $]);
    }

    /**
     * Insert a word into the Trie.
     *
     * Characters that don't map to any digit are ignored in building the Trie.
     * However, the original form of the word will be retained as-is in the
     * leaf node.
     */
    void insert(string word)
    {
        insert(word, word[]);
    }

    /**
     * Iterate over all words stored in this Trie.
     */
    void foreachEntry(void delegate(string path, string word) cb)
    {
        void impl(Trie node, string path = "")
        {
            if (node is null) return;
            foreach (word; node.words)
            {
                cb(path, word);
            }
            foreach (i, child; node.edges)
            {
                impl(child, path ~ cast(char)('0' + i));
            }
        }
        impl(this);
    }
}

/**
 * Loads the given dictionary into a Trie.
 */
Trie loadDictionary(R)(R lines)
    if (isInputRange!R & is(ElementType!R : const(char)[]))
{
    Trie result = new Trie;
    foreach (line; lines)
    {
        result.insert(line.idup);
    }
    return result;
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto app = appender!(string[]);
    dict.foreachEntry((path, word) { app ~= format("%s: %s", path, word); });
    assert(app.data == [
        "10: je",
        "105513: jemand",
        "107: neu",
        "1550: Name",
        "253302: Wasser",
        "35: da",
        "38: so",
        "400: Fee",
        "4021: fern",
        "4034: Fest",
        "482: Tor",
        "4824: fort",
        "4824: Torf",
        "51: an",
        "562: mir",
        "562: Mix",
        "56202: Mixer",
        "78: Bo\"",
        "783: bo\"s",
        "7857: blau",
        "7884: Boot",
        "824: Ort",
        "83: o\"d"
    ]);
}

/**
 * Find all encodings of the given phoneNumber according to the given
 * dictionary, and write each encoding to the given sink.
 */
void findMatches(W)(Trie dict, const(char)[] phoneNumber, W sink)
    if (isOutputRange!(W, string))
{
    bool impl(Trie node, const(char)[] suffix, string[] path, bool allowDigit)
    {
        if (node is null)
            return false;

        // Ignore non-digit characters in phone number
        while (!suffix.empty && (suffix[0] < '0' || suffix[0] > '9'))
            suffix = suffix[1 .. $];

        if (suffix.empty)
        {
            // Found a match, print result
            foreach (word; node.words)
            {
                put(sink, format("%s: %-(%s %)", phoneNumber,
                                 path.chain(only(word))));
            }
            return !node.words.empty;
        }

        bool ret;
        foreach (word; node.words)
        {
            // Found a matching word, try to match the rest of the phone
            // number.
            ret = true;
            if (impl(dict, suffix, path ~ word, true))
                allowDigit = false;
        }

        if (impl(node.edges[suffix[0] - '0'], suffix[1 .. $], path, false))
        {
            allowDigit = false;
            ret = true;
        }

        if (allowDigit)
        {
            // If we got here, it means that if we take the current node as the
            // next word choice, then the following digit will have no further
            // matches, and we may encode it as a single digit.
            auto nextSuffix = suffix[1 .. $];
            if (nextSuffix.empty)
            {
                put(sink, format("%s: %-(%s %)", phoneNumber,
                                 path.chain(suffix[0 .. 1].only)));
                ret = true;
            }
            else
            {
                if (impl(dict, suffix[1 .. $], path ~ [ suffix[0] ], false))
                    ret = true;
            }
        }
        return ret;
    }

    // Trim trailing non-digits from phone number
    auto suffix = phoneNumber[];
    while (!suffix.empty && (suffix[$-1] < '0' || suffix[$-1] > '9'))
    {
        suffix = suffix[0 .. $-1];
    }

    impl(dict, suffix, [], true);
}

/**
 * Encode the given input range of phone numbers according to the given
 * dictionary, writing the output to the given sink.
 */
void encodePhoneNumbers(R,W)(R input, Trie dict, W sink)
    if (isInputRange!R & is(ElementType!R : const(char)[]) &&
        isOutputRange!(W, string))
{
    foreach (line; input)
    {
        findMatches(dict, line, sink);
    }
}

///
unittest
{
    auto dict = loadDictionary(q"ENDDICT
an
blau
Bo"
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
ENDDICT".splitLines);

    auto input = [
        "112",
        "5624-82",
        "4824",
        "0721/608-4067",
        "10/783--5",
        "1078-913-5",
        "381482",
        "04824",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (string match) { app.put(match); });

    //writefln("\n%-(%s\n%)", app.data);
    assert(app.data.sort.release == [
        "04824: 0 Tor 4",
        "04824: 0 Torf",
        "04824: 0 fort",
        "10/783--5: je Bo\" da",
        "10/783--5: je bo\"s 5",
        "10/783--5: neu o\"d 5",
        "381482: so 1 Tor",
        "4824: Tor 4",
        "4824: Torf",
        "4824: fort",
        "5624-82: Mix Tor",
        "5624-82: mir Tor",
    ]);
}

unittest
{
    auto dict = loadDictionary(q"ENDDICT
Bias
ja
Mai
Reck
Weib
USA
ENDDICT".splitLines);

    auto input = [
        "/7-357653152/0677-",
        "/7-3576-",
        "/8-",
        "1556/0",
    ];

    auto app = appender!(string[]);
    encodePhoneNumbers(input, dict, (string match) { app.put(match); });

    //writefln("\n%-(%s\n%)", app.data);
    assert(app.data.sort.release == [
        "/7-357653152/0677-: USA Bias ja Reck 7",
        "/7-357653152/0677-: USA Bias ja Weib 7",
        "/8-: 8",

        /* Note: 1556/0 should NOT encode as "1 Mai 0" because the initial "15"
         * matches "ja", thus excluding a digit in that position. */
    ]);
}

/**
 * Program entry point.
 */
int main(string[] args)
{
    File input = stdin;
    bool countOnly;

    auto info = getopt(args,
        "c|count", "Count solutions only", &countOnly,
    );

    int showHelp()
    {
        stderr.writefln("Usage: %s [options] <dictfile> [<inputfile>]",
                        args[0]);
        defaultGetoptPrinter("", info.options);
        return 1;
    }

    if (info.helpWanted || args.length < 2)
        return showHelp();

    auto dictfile = args[1];
    if (args.length > 2)
        input = File(args[2]);

    Trie dict = loadDictionary(File(dictfile).byLine);

    if (countOnly)
    {
        size_t count;
        encodePhoneNumbers(input.byLine, dict, (string match) { count++; });
        writefln("Number of solutions: %d", count);
    }
    else
    {
        encodePhoneNumbers(input.byLine, dict, (string match) {
            writeln(match);
        });
    }

    return 0;
}
------------------------------------snip------------------------------------


T

-- 
The early bird gets the worm. Moral: ewww...

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Renato
in reply to H. S. Teoh

Permalink

Renato

Posted in reply to H. S. Teoh

Permalink

On Tuesday, 16 January 2024 at 20:34:48 UTC, H. S. Teoh wrote:

On Tue, Jan 16, 2024 at 12:28:49PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...]

Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below.
[...]

Great, I ran the benchmarks for you :)

I had to change how you accept arguments, even though you did "the right thing" using getopt, the solutions should just take a count or print argument first...

Anyway, here's your result:

===> ./rust
./rust,24133632,25
./rust,24739840,130
./rust,24477696,536
./rust,25247744,1064
./rust,8175616,6148
./rust,8306688,8315
===> src/d/dencoder
src/d/dencoder,46055424,43
src/d/dencoder,96337920,146
src/d/dencoder,102350848,542
src/d/dencoder,102268928,1032
src/d/dencoder,40206336,99936
^C

It took too long with the count option, so I had to abort before the last run ended... there's probably some bug there, otherwise the Trie runs very fast, as I had expected.

For the record (I already posted this on GitHub)... here's my current fastest solution time using the same algorithm as Rust (after I fixed my solution that was using int128, which was not valid, as @ssvb pointed out):

Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23986176,24
./rust,24346624,118
./rust,24543232,525
./rust,24346624,1027
./rust,8011776,3830
./rust,8486912,18403
===> src/d/dencoder
src/d/dencoder,13615104,30
src/d/dencoder,24723456,374
src/d/dencoder,24592384,1750
src/d/dencoder,24788992,3472
src/d/dencoder,11517952,10235
src/d/dencoder,11517952,51211

Not great :(

But @ssvb came up with a really fast implementation, though totally different algorithm (so dont' take this as evidence of D being faster than Rust or anything like that):

Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23986176,24
./rust,24461312,135
./rust,24494080,514
./rust,24526848,981
./rust,8175616,3647
./rust,8011776,15404
===> src/d/dencoder
src/d/dencoder,16433152,30
src/d/dencoder,16613376,125
src/d/dencoder,16613376,468
src/d/dencoder,16613376,941
src/d/dencoder,5570560,12
src/d/dencoder,6701056,18

I can't explain why it's so incredibly fast, specially for the count case. I tried using the same hashing function on my solution, but that didn't really help me!

Pretty cool to see different solutions to the problem, but I'm still curious to know where the solution I wrote is being slow compared to Rust which is using identical, to my understanding, code! I'm sure people could write incredibly fast code to solve this problem, but what I am really curious about is what the code I wrote is doing wrong that causes it to run 4x slower than Rust despite doing "the same thing"...

January 16

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Siarhei Siamashka
in reply to Renato

Permalink

Siarhei Siamashka

Posted in reply to Renato

Permalink

On Tuesday, 16 January 2024 at 21:15:19 UTC, Renato wrote:

I can't explain why it's so incredibly fast, specially for the count case. I tried using the same hashing function on my solution, but that didn't really help me!

That's dynamic programming with memoization. Basically caching the already calculated results (in the dp array) and avoiding recalculations when the recursive function revisits the same state again.

Top | Forum index | About this forum

Forums