Jump to page: 1 26  
Page
Thread overview
Help optimize D solution to phone encoding problem: extremely slow performace.
Jan 13
Renato
Jan 13
Renato
Jan 13
Renato
Jan 13
Sergey
Jan 14
Renato
Jan 14
Renato
Jan 15
Sergey
Jan 15
Renato
Jan 16
Renato
Jan 16
Renato
Jan 17
Renato
Jan 17
Renato
Jan 18
Renato
Jan 19
Renato
Jan 19
Renato
Jan 19
evilrat
Jan 19
Renato
Jan 19
Renato
Jan 19
ryuukk_
Jan 19
evilrat
Jan 19
ryuukk_
Jan 20
Renato
Jan 20
Renato
Jan 20
Renato
Jan 18
Renato
Jan 17
Renato
Jan 17
evilrat
Jan 17
Renato
Jan 17
Renato
Jan 17
evilrat
Jan 17
Renato
Jan 17
evilrat
Jan 17
Renato
Jan 17
Renato
Jan 17
evilrat
January 13

I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language.

You can check my initial blog post about this, which was an analysis or the original study by Prechelt about the productivity differences between programmers using different languages.

This original problem had a flaw when used in modern computers as the programs can find solutions so quickly that most of the time in the benchmarks was being used to actually print solutions to stdout, so I modified the problem so that there's an option to either print the solutions, or just count the number of solutions - so the programs still need to do all the work, but are not required to print anything other than how many solutions were found.

Anyway, I ported the Common Lisp solution to D because, like CL, D has built-in data structures like associative arrays and BigInt (at least it's in the stdlib)... I thought this would actually give D an edge! But it turns out D performs very badly for larger input sets. It starts quite well on smaller inputs, but scales very poorly.

My initial Rust solution also performed very poorly, so I'm afraid the same is happening with my initial D solution, despite my best effort to write something "decent".

You can find my D solution in this Pull Request.

The solution is almost identical, to the extent possible, to the Common Lisp solution, which can be found here.

The secret to high performance for the algorithm being used is having a very efficient BigInt implementation, and fast hashing function for the hash table (associative array in D, with BigInt as keys). Hence, I suppose D's hash function or BigInt are not that fast (Rust's default hash is also not great due to security concerns, but it's very easy to use a custom one which is much faster by changing a single line of code).

Anyway, here's the current relative performance (the other languages are pretty heavily optimised, the Java solution uses a different algorithm so it's not directly comparable, the Rust solution uses approx. the same algorithm as used in CL and D, but instead of BigInt, it uses a Vec<u8> as key as that turned out to be faster - I may try that technique in D as well - but notice that even using Rust's slower BigInt, it was orders of magnitude faster than D).

Benchmarking...
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,109895680,377
java-Main,179634176,1025
java-Main,167149568,1621
java-Main,180203520,2493
java-Main,96481280,6112
java-Main,95526912,7989
===> sbcl --script src/lisp/main.fasl
sbcl,31780864,74
sbcl,79437824,888
sbcl,79613952,3991
sbcl,80654336,7622
sbcl,80420864,18623
sbcl,83402752,29503
===> ./rust
./rust,23257088,58
./rust,23437312,260
./rust,23433216,1077
./rust,23416832,2017
./rust,7106560,6757
./rust,7110656,10165
===> src/d/dencoder
src/d/dencoder,38748160,223
src/d/dencoder,75800576,3154
src/d/dencoder,75788288,14905
src/d/dencoder,75751424,30271

I had to abort D as it was taking too long.

The above is with the print option (that's normally slow when stdout is not buffered, but I did buffer D's writes and it's still very slow).

With the count option, which does not print anything except a number at the end, D took so long I couldn't even collect its results (I waited several minutes)... the other languages finished in much less than a minute:

Benchmarking...
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,124112896,7883
java-Main,107487232,9273
===> sbcl --script src/lisp/main.fasl
sbcl,82669568,25638
sbcl,83759104,33501
===> ./rust
./rust,7061504,9488
./rust,7127040,11441
===> src/d/dencoder
^C
(ldc-1.36.0)

I tried to profile the D code but the profiler seems to be broken on my OS (Mac):

▶ dub build -b profile
    Starting Performing "profile" build using /Users/renato/dlang/ldc-1.36.0/bin/ldc2 for x86_64.
    Building prechelt ~master: building configuration [application]
     Linking dencoder
(ldc-1.36.0)
prechelt-phone-number-encoding/src/d  dlang ✗                                                                                 9m ◒
▶ cd ../..
(ldc-1.36.0)
programming/projects/prechelt-phone-number-encoding  dlang ✗                                                                  9m ◒
▶ src/d/dencoder
[1]    17877 illegal hardware instruction  src/d/dencoder
(ldc-1.36.0)

It's a bit hard to do any optmisation while "blind".

So, I decided to post it here to collect some hints about what to try and whether I hit some common performance bottlenecks in my current solution.

Would appreciate anything you can suggest, as long as you respect the fact that I'm still learning D so you can't expect me to know everything about it.

January 13

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:

>

I tried to profile the D code but the profiler seems to be broken on my OS (Mac):

I profiled it on Linux and stored the trace.log file on a public Gist.

I tried using the HTML viewer recommended in the wiki but that doesn't work... first try:

Running ../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/d-profile-viewer
Corrupt trace.log (can't compute ticks per second), please re-profile and try again

Second try with a longer trace (the one I saved in the gist):

Running ../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/d-profile-viewer
std.conv.ConvException@/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d(2533): Unexpected '-' when converting from type char[] to type ulong
----------------
??:?
 [0x55c4be9bc99e]
??:?
 [0x55c4be9bc612]
??:?
 [0x55c4be9de81e]
??:?
 [0x55c4be9c4fbf]
/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:2533 [0x55c4be98ba2f]
/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:2002 [0x55c4be98b6fc]
/home/renato/dlang/ldc-1.36.0/bin/../import/std/conv.d:210 [0x55c4be9665ec]
../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/source/app.d:1095 [0x55c4be965e21]
../../../.dub/packages/d-profile-viewer/~master/d-profile-viewer/source/app.d:1138 [0x55c4be96698a]
??:?
 [0x55c4be9c4c9c]

Not a great profiling experience :). Anyone has a better suggestion to "parse" the trace file?

January 13

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

>

[...]
Not a great profiling experience :). Anyone has a better suggestion to "parse" the trace file?

As a drive-by suggestion and I hope it doesn't derail anything, but if you have the opportunity to run it on linux, have you tried profiling with callgrind instead, with {Q,K}Cachegrind to visualise things? Your repositories probably have them. (callgrind is a part of valgrind.)

The wiki only mentions callgrind in passing, but it has worked well for me. (example)

January 13

On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote:

>

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

>

[...]
Not a great profiling experience :). Anyone has a better suggestion to "parse" the trace file?

As a drive-by suggestion and I hope it doesn't derail anything, but if you have the opportunity to run it on linux, have you tried profiling with callgrind instead, with {Q,K}Cachegrind to visualise things? Your repositories probably have them. (callgrind is a part of valgrind.)

The wiki only mentions callgrind in passing, but it has worked well for me. (example)

Thanks for the suggestion, this looks promising as I do have a Linux laptop (just not my main one).

I will have to try it... I thought that BigInt was to blame for the slowness (from what I could read from the trace logs), but after replacing that with basically a byte array key (see commit here) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.

January 13

On Saturday, 13 January 2024 at 19:35:57 UTC, Renato wrote:

>

On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote:

>

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

>

[...]
I will have to try it... I thought that BigInt was to blame for the slowness (from what I could read from the trace logs), but after replacing that with basically a byte array key (see commit here) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.

In the repo is hard to find the proper version.
I've checked the Rust from master branch and it looks a bit different from D implementation..

I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:

  • do not use BigInt from std. It could be quite slow. Try to use GMP library from Dub instead
  • don't do "idup" every time
  • instead of byLine, try byLineCopy
  • instead of "arr ~= data" try to use Appender (https://dlang.org/library/std/array/appender.html)
  • also you could try to use splitter (https://dlang.org/library/std/algorithm/iteration/splitter.html) to lazily process each part of the data
  • isLastDigit function has many checks, but I think it could be implemented easier in a Rust way
  • also consider to use functions from Range (filter, map) as you use it in Rust, instead of using for loops
January 14

On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote:

>

I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:
[...]

I would strongly argue for profiling first instead of optimising based on conjecture. If you profile you have solid evidence on what is actually slow. If you're very good at analysing D, well-educated hypotheses may be enough, until they suddenly aren't and you will have spent a lot of time on the wrong problem.

January 14

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:

>

I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language.

[...]

Hello Renato,

This seems to be quite a lot of calls:

======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

19204964        3761        3756           0     pure nothrow ref @trusted immutable(char)[][] core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)

19204924        8957        3474           0     @safe void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])

This is when using the words-quarter.txt input (the dictionary.txt input seems to finish much faster, although still slower than java/rust).

I also used only 100 phone numbers as input.

My final observation is that words-quarter.txt contains some 1-letter inputs, (for example, i or m)...this may result in a large number of encoding permutations, which may explain the high number of recursion calls?

Jordan

January 14

On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote:

>

On Saturday, 13 January 2024 at 19:35:57 UTC, Renato wrote:

>

On Saturday, 13 January 2024 at 17:00:58 UTC, Anonymouse wrote:

>

On Saturday, 13 January 2024 at 12:55:27 UTC, Renato wrote:

>

[...]
I will have to try it... I thought that BigInt was to blame for the slowness (from what I could read from the trace logs), but after replacing that with basically a byte array key (see commit here) it barely improved. It's still much slower than Common Lisp and very, very far from Java and Rust.

In the repo is hard to find the proper version.
I've checked the Rust from master branch and it looks a bit different from D implementation..

I explicitly linked to the Rust implementation in my question:

>

the Rust solution uses approx. the same algorithm as used in CL and D, but instead of BigInt, it uses a Vec as key

Can you be more specific about which part of the Rust implementation is not roughly equivalent to the D implementation?? There are obvious differences in style and syntax, but I hope that it's mostly equivalent algorithm-wise.

But to clarify: the branch where the fastest solutions are is called fastest-implementations-print-or-count.

The D branches with my alternative implementations are all called dlang-*.

>

I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:

  • do not use BigInt from std. It could be quite slow. Try to use GMP library from Dub instead

Right, but as I posted above, even using a byte-array just like in Rust, the performance was still very bad (but around 2x faster than using BigInt).

Also, using GMP is always suggested to me, but I can't accept that because my goal is not to use C libraries to achieve the fastest speed (which I could do by using C or FFI in any other language), it's to use D (and the other languages) to solve my problem in an idiomatic way.

I ended up using Int128 (the problem description requires more than i64, but int128 is enough), which grealy improved performance over my D baseline AND on the byte-array solution.

>
  • don't do "idup" every time
  • instead of byLine, try byLineCopy

idup is necessary because the strings are stored in a Map after the iteration ends. Anyway, I replaced that with byLineCopy and it became sligthtly slower.

>

I was worried about ~= allocating too much, though IIUC it shouldn't be a problem the way I used it... in any case, I replaced it with Appender and the performance was completely unchanged - which is good as I think it means I used ~= correctly to avoid overallocation (or I messed up using Appender :D - pls have a look).

>

This is not necessary because that would only affect the time spent loading the dictionary, which is the faster part of the problem... nearly all of the time is spent looking for solutions instead.

>
  • isLastDigit function has many checks, but I think it could be implemented easier in a Rust way

The Rust solution uses sum types for that, but that had a very minor effect on the overall performance (though in Rust, that's "cleaner" to use anyway)... I know D does have SumType in the stdlib, but I thought it is not as "zero cost" as in Rust, and because both variants wrap a String, it's awkward to use that... so I instead tried using a struct like this:

struct WordOrDigit {
    string value;
    bool isDigit;
}

You can check the commit here.

This change made the code slightly slower.

>
  • also consider to use functions from Range (filter, map) as you use it in Rust, instead of using for loops

Why would for loops be slower? Or you're just saying I should make the D code nicer?

Anyway, thanks for the suggestions!

On Sunday, 14 January 2024 at 08:33:24 UTC, Anonymouse wrote:

>

On Saturday, 13 January 2024 at 23:20:32 UTC, Sergey wrote:

>

I would suggest to rewrite in the same way as Rust implemented.
Probably you would like to try:
[...]

I would strongly argue for profiling first instead of optimising based on conjecture. If you profile you have solid evidence on what is actually slow. If you're very good at analysing D, well-educated hypotheses may be enough, until they suddenly aren't and you will have spent a lot of time on the wrong problem.

Totally agree. I will spend some time on my Linux machine to see if I can get more useful profiling data.

Current Performance

For now, with the int128 change from BigInt, the code is about 4x faster than before, but on larger problems, it's still scaling very badly compared to other languages (this suggests there's still some "exponential growth" going on, which should not happen as the problem is not exponential).

Proc,Memory(bytes),Time(ms)
===> ./rust
./rust,23252992,59
./rust,23420928,263
./rust,23412736,1025
./rust,7069696,8661
===> src/d/dencoder
src/d/dencoder,11268096,72
src/d/dencoder,23781376,938
src/d/dencoder,23818240,4443
src/d/dencoder,10723328,134703

On the bright side: D is using almost as little memory as Rust, which is orders of magnitude better than the other, usual GC languages.

January 14

On Sunday, 14 January 2024 at 10:02:58 UTC, Jordan Wilson wrote:

>

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:

>

I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language.

[...]

Hello Renato,

This seems to be quite a lot of calls:

======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

19204964        3761        3756           0     pure nothrow ref @trusted immutable(char)[][] core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)

19204924        8957        3474           0     @safe void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])

This is when using the words-quarter.txt input (the dictionary.txt input seems to finish much faster, although still slower than java/rust).

I also used only 100 phone numbers as input.

My final observation is that words-quarter.txt contains some 1-letter inputs, (for example, i or m)...this may result in a large number of encoding permutations, which may explain the high number of recursion calls?

Jordan

The characteristics of the dictionary impact the number of solutions greatly. I explored this in my blog post about Common Lisp, Part 2.

The fact there's a ridiculous amount of function calls is why this problem can take minutes even without having to allocate much memory or print anything.

** I've managed to improve the D code enough that it is now faster than Common Lisp and the equivalent algorithm in Java.**

It took some profiling to do that, though... thanks to @Anonymouse for the suggestion to use Valgrind... with that, I was able to profile the code nicely (Valgrind works nicely with D, it doesn't even show mangled names!).

Here's what I did.

First: the solution using int128 was much faster than the BigInt solution, as I had already mentioned. But when profiling that, it was clear that for the problems with a very large number of solutions, GC became a problem:

--------------------------------------------------------------------------------
23,044,096,944 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------
7,079,878,197 (30.72%)  ???:core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
2,375,100,857 (10.31%)  ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,971,210,820 ( 8.55%)  ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,922,961,924 ( 8.34%)  ???:_d_arraysetlengthT [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,298,747,622 ( 5.64%)  ???:core.int128.mul(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,134,644,706 ( 4.92%)  ???:core.internal.gc.bits.GCBits.setLocked(ulong) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  849,587,834 ( 3.69%)  ???:core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref ulong, uint, const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  827,407,988 ( 3.59%)  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
  688,845,027 ( 2.99%)  ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
  575,415,884 ( 2.50%)  ???:_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  562,146,592 ( 2.44%)  ???:core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
  526,067,586 ( 2.28%)  ???:core.internal.spinlock.SpinLock.lock() shared [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]

So, I decided to use std.container.Array (I had tried Appender before but that didn't really work) to make sure no allocation was happening when adding/removing words from the words which are carried through the printTranslations calls (that's the hot path).

As you can see in the commit, the code is just slightly less convenient...
this made the code run considerably faster for the very long runs!

Here's the Callgrind profiling data AFTER this change:

6,547,316,944 (32.42%)  ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
5,229,076,596 (25.90%)  ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
3,871,644,402 (19.17%)  ???:core.int128.mul(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  677,533,800 ( 3.36%)  ???:std.int128.Int128.toHash() const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  543,388,688 ( 2.69%)  ???:std.int128.Int128.this(core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  543,388,688 ( 2.69%)  ???:std.int128.Int128.this(long, long) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  542,027,045 ( 2.68%)  ???:object.TypeInfo_Struct.getHash(scope const(void*)) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  424,849,937 ( 2.10%)  ???:object.TypeInfo_Struct.equals(in void, in void) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]

No more GC! It seems that now, as expected, most time is spent calculating hashes for each possible solution (as it should be). I am not entirely sure what the _aaInX call is, but I believe that's the associative array's membership check?

The D profiler seems to show similar data:

======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

104001600         158         158           0     const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("*").opBinary(std.int128.Int128)
402933936          63          63           0     const pure nothrow @property @nogc @safe bool std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized()
183744166          87          59           0     inout pure nothrow ref @property @nogc return @safe inout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload()
103784880          68          38           0     const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("+", int).opBinary(const(int))
103784880          99          31           0     pure nothrow ref @nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("+", int).opOpAssign(int)
104001600          30          30           0     const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("+").opBinary(std.int128.Int128)
61167377          67          29           0     pure nothrow @nogc void std.container.array.Array!(immutable(char)[]).Array.__fieldDtor()
43444575          61          27           0     pure nothrow @nogc @safe void core.internal.lifetime.emplaceRef!(immutable(char)[], immutable(char)[], immutable(char)[]).emplaceRef(ref immutable(char)[], ref immutable(char)[])
48427530          79          27           0     const pure nothrow @property @nogc @safe bool std.container.array.Array!(immutable(char)[]).Array.empty()
61167377          38          27           0     pure nothrow @nogc void std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__dtor()
43444575         130          24           0     pure nothrow @nogc ulong std.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[])
61167376          32          23           0     pure nothrow @nogc @safe void std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__postblit()

Anyway, after this change, the code now scaled much better.

But there's nothing much lest to be optimised... except the arithmetics (notice how the ("*").opBinary call is near the top.

I know how to make that faster using a similar strategy that I used in Rust (you can check my journey to optimise the Rust code on my blog post about that - specifically, check the Using packed bytes for more efficient storage section)... knowing how the encoding works, I knew that multiplication is not really needed. So, instead of doing this:

n *= 10;
n += c - '0';

I could do this:

n <<= 4;
n += c;

This changes the actual number, but that doesn't matter: the code is still unique for each number, which is all that we need.

You can see the full commit here.

This improved the performance for sure, even if not by much.

The profiling data after this arithmetic trick looks like this:

Valgrind:

4,821,217,799 (35.75%)  ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
4,241,404,850 (31.45%)  ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
1,038,733,782 ( 7.70%)  ???:core.int128.shl(core.int128.Cent, uint) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  575,372,270 ( 4.27%)  ???:std.int128.Int128.toHash() const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  461,659,456 ( 3.42%)  ???:std.int128.Int128.this(core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  460,297,816 ( 3.41%)  ???:object.TypeInfo_Struct.getHash(scope const(void*)) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  347,198,989 ( 2.57%)  ???:object.TypeInfo_Struct.equals(in void, in void) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
  288,537,160 ( 2.14%)  ???:core.int128.add(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]

D Profiler:

======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

29879388       44037       11267           0     void dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)
126827950        4306        3599           0     inout pure nothrow ref @property @nogc return @safe inout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload()
29879268        7293        3100           0     pure nothrow @nogc ulong std.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[])
33534720        4896        2780           0     const pure nothrow @property @nogc @safe bool std.container.array.Array!(immutable(char)[]).Array.empty()
29879268       11834        2479           0     pure nothrow @nogc ulong std.container.array.Array!(immutable(char)[]).Array.insertBack!(immutable(char)[]).insertBack(immutable(char)[])
29879268        8702        2279           0     pure nothrow @nogc @safe void std.container.array.Array!(immutable(char)[]).Array.removeBack()
282919518        2045        2045           0     const pure nothrow @property @nogc @safe bool std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized()
29879268        2534        1859           0     pure nothrow @nogc @safe void core.internal.lifetime.emplaceRef!(immutable(char)[], immutable(char)[], immutable(char)[]).emplaceRef(ref immutable(char)[], ref immutable(char)[])
44511077        3264        1505           0     pure nothrow @nogc void std.container.array.Array!(immutable(char)[]).Array.__fieldDtor()
44511076        2588        1254           0     pure nothrow @nogc scope void std.container.array.Array!(immutable(char)[]).Array.__fieldPostblit()
49758574        1684        1243           0     const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("+", char).opBinary(const(char))
49758574        2907        1223           0     pure nothrow ref @nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("+", immutable(char)).opOpAssign(ref immutable(char))
49975294        1796        1208           0     pure nothrow ref @nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("<<", int).opOpAssign(int)

As far as I can tell, the only two bottlenecks now bacame:

  • std.typecons.RefCounted!(Array.Payload, 0).RefCounted.refCountedPayload()
  • the Int128 implementation!

To fix the former, I would need to implement my own Array, I suppose... using ref-counting here seems unnecessary??

But that is a bit beyond what I am willing to do!

The latter could probably be fixed by not using a numeric key at all, just bytes - but my previous attempt at doing that didn't really give better results. Or perhaps by overriding the hashing for Int128 (I didn't try that because I assume the default should be pretty optimal already)??

So, I ran out of options and am going to call it done.

Here's my final D solution.

The code is still readable (I think everyone can agree D is one of the most readable languages of all)!

You can see how it performs in the last comparison run I did (all data, including a nice chart, in this gist):

Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,3190730752,441
java-Main,3194789888,1440
java-Main,3193692160,2316
java-Main,3194720256,3604
java-Main,3192963072,12861
java-Main,3261128704,31282
===> sbcl --script src/lisp/main.fasl
sbcl,1275994112,83
sbcl,1275998208,1396
sbcl,1275998208,5925
sbcl,1275998208,9752
sbcl,1275998208,64811
sbcl,1276006400,153799
===> ./rust
./rust,23138304,56
./rust,23138304,234
./rust,23138304,1288
./rust,23138304,2444
./rust,9027584,15867
./rust,9027584,36985
===> src/d/dencoder
src/d/dencoder,219041792,67
src/d/dencoder,229904384,1114
src/d/dencoder,229904384,4421
src/d/dencoder,229904384,9087
src/d/dencoder,219041792,51315
src/d/dencoder,219041792,120818

Conclusion

Using Int128 as a key instead of BigInt made the code much faster (around 4x faster).

Using Array to avoid too many reallocations with primitive dynamic arrays made the code run much faster for very large numbers of calls (very little difference when running for less than 10s, but at least 2x faster at round 1min runs where the tiny overhead adds up).

As you can see in the chart I posted, D's speed performance is close to that of Common Lisp for all runs, though between 10 and 20% faster. It's nowhere near Rust, unfortunately, with Rust being almost 4x faster (pls ignore the Java solution as that's using a Trie instead of the "numeric" solution - so it's not a fair comparison). I had expected to get very close to Rust, but that didn't happen... I just can't see in the profiling data what's causing D to fall so far behind!

On the memory data: D uses much less memory than Java and Common Lisp, but still a lot higher than Rust.

If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in betterC style?!

January 15

On Sunday, 14 January 2024 at 17:11:27 UTC, Renato wrote:

>

If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in betterC style?!

I've added port from Rust in the PR comment. Can you please check this solution?
Most probably it need to be optimized with profiler. Just interesting how close-enough port will work.

« First   ‹ Prev
1 2 3 4 5 6