January 19

On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote:

>

On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via Digitalmars-d-learn wrote: [...]

>

Ok, last time I'm running this for someone else :D

Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,30
./rust,24018944,147
./rust,24068096,592
./rust,24150016,1187
./rust,7766016,4972
./rust,8011776,46101
===> src/d/dencoder
src/d/dencoder,44154880,42
src/d/dencoder,51347456,87
src/d/dencoder,51380224,273
src/d/dencoder,51462144,441
src/d/dencoder,18644992,4414
src/d/dencoder,18710528,43548

OK, this piqued my interest enough that I decided to install rust using rustup instead of my distro's package manager. Here are the numbers I got for my machine:

===> ./rust
./rust,22896640,35
./rust,22896640,137
./rust,22384640,542
./rust,22896640,1034
./rust,8785920,2489
./rust,8785920,12157
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,72
src/d/dencoder,1066799104,198
src/d/dencoder,1066799104,344
src/d/dencoder,1035292672,2372
src/d/dencoder,1035292672,13867

Looks like we lost out to Rust for larger inputs. :-D Probably due to environmental factors (and the fact that std.stdio is slow). I re-ran it again and got this:

===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

Notice the significant discrepancy between the two runs; this seems to show that the benchmark is only accurate up to about ±1.5 seconds.

That's not correct. The discrepancy is due to the benchmark always generating different input on each run - and the characteristics of the input affects runs significantly. This affects the last two runs much more due to them using the more challenging dictionary. The important is that the relative performance between languages is reliable.

If you stop re-generating the phone numbers and just run the benchmark multiple times using the same input, you'll see it's very close between runs.

>

Anyway, oddly enough, Java seems to beat Rust on larger inputs. Maybe my Java compiler has a better JIT implementation? :-P

Again, in case I haven't made it clear by repeating this multiple times: The Java code is using a Trie, the Rust code is using a numeric solution. They are completely different algorithms. Much more important than which language is being used is the algorithm, as has been shown again and again.

> >

Congratulations on beating Rust :D but remember: you're using a much more efficient algorithm! I must conclude that the Rust translation of the Trie algorithm would be much faster still, unfortunately (you may have noticed that I am on D's side here!).

At this point, it's not really about the difference between languages anymore; it's about the programmer's skill at optimizing his code.

Traditionally Java is thought to be the slowest, because it runs in a VM and generally tends to use more heap allocations. In recent times, however, JIT and advanced GC implementations have significantly levelled that out, so you're probably not going to see the difference unless you hand-tweak your code down to the bare metal.

Java has been very fast for over a decade now, this is not new at all.

>

Surprisingly, at least on my machine, Lisp actually performed the worst. I'd have thought it would at least beat Java, but I was quite wrong. :-D Perhaps the Lisp implementation I'm using is suboptimal, I don't know. Or perhaps modern JVMs have really overtaken Lisp.

Please don't compare different algorithms in different languages and make conclusions about each language's speed.

>

Now I'm really curious how a Rust version of the trie algorithm would perform. Unfortunately I don't know Rust so I wouldn't be able to write it myself. (Hint, hint, nudge, nudge ;-)).

As far as the performance of my D version is concerned, I still haven't squeezed out all the performance I could yet. Going into this, my intention was to take the lazy way of optimizing only what the profiler points out to me, with the slight ulterior motive of proving that a relatively small amount of targeted optimizations can go a long way at making the GC a non-problem in your typical D code. ;-) I haven't pulled out all the optimization guns at my disposal yet.

You don't really have to... @ssvb's solution is incredibly fast at the "count" problem and I really don't think anyone can beat that implementation. The only problem is that the implementation is very C-like and nothing like D I would write.

>

If I were to go the next step, I'd split up the impl() function so that I get a better profile of where it's spending most of its time, and then optimize that. My current suspicion is that the traversal of the trie could be improved by caching intermediate results to eliminate a good proportion of recursive calls in impl().

Also, the print mode of operation is quite slow, probably because writefln() allocates. (It allocates less than if I had used .format like I did before, but it nevertheless still allocates.) To alleviate this cost, I'd allocate an output buffer and write to that, flushing only once it filled up.

Another thing I could do is to use std.parallelism.parallel to run searches on batches of phone numbers in parallel. This is kinda cheating, though, since it's the same algorithm with the same cost, we're just putting more CPU cores to work. :-P But in D this is quite easy to do, often as easy as simply adding .parallel to your outer foreach loop. In this particular case it will need some additional refactoring due to the fact that the input is being read line by line. But it's relatively easy to load the input into a buffer by chunks instead, and just run the searches on all the numbers found in the buffer in parallel.

I don't have interest in doing that because what I was trying to accomplish was to compare the exact same algorithm in different languages to the extent possible. This is why I ported the Common Lisp code to D, then tried to optimize from there just like I did for Rust. I find the numeric implementation much more "valuable" in comparing performance than the Trie solution because that evaluates quite a few things in the language that are common to most programs, like hashing, recursion and "search" in general.

Do you know why the whole thread seems to have disappeared?? There's a lot of good stuff in the thread, it would be a huge shame to lose all that!

January 19
> >

Looks like we lost out to Rust for larger inputs. :-D Probably due to environmental factors (and the fact that std.stdio is slow). I re-ran it again and got this:

===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well.

This is what I would like to be discussing in this thread: why is D running at Java speeds and not at D speeds when using the same algorithm? I know there's small differences in the implementations, they are different languages after all, but not enough IMO to justify anything like 3x difference from Rust.

If you really want to show to me that D is as fast as Rust, please spend time on my solution (because that's a direct comparison algorithmically to the Rust implementation) and remove any unnecessary overhead from it (without changing the overall algorithm , of course - unless that's to make it closer to the Rust implementation).

>

Do you know why the whole thread seems to have disappeared?? There's a lot of good stuff in the thread, it would be a huge shame to lose all that!

Nevermind, looks like I was unable to open any other answers in this thread while writing my own answer!?

January 19

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:

>

I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well.

This is what I would like to be discussing in this thread: why is D running at Java speeds and not at D speeds when using the same algorithm? I know there's small differences in the implementations, they are different languages after all, but not enough IMO to justify anything like 3x difference from Rust.

My guess is that's because int128 is not that much optimized due to being not so popular type, though to answer what's wrong would require to look at assembly code produced for both D and Rust.

Additionally if you comparing D by measuring DMD performance - don't.
It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.

January 19

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

>

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:

>

I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well.

This is what I would like to be discussing in this thread: why is D running at Java speeds and not at D speeds when using the same algorithm? I know there's small differences in the implementations, they are different languages after all, but not enough IMO to justify anything like 3x difference from Rust.

My guess is that's because int128 is not that much optimized due to being not so popular type, though to answer what's wrong would require to look at assembly code produced for both D and Rust.

Additionally if you comparing D by measuring DMD performance - don't.
It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.

I am not using int128 anymore. I explained why a few posts back. I am using a byte array and computing the hash incrementally when trying different input, so that partially computed hashes are re-used on each try (this is a bit cheating, as Rust is not doing that, but I consider that to be acceptable as it's still computing hashes and looking up entries in the associative array).

I used all D compilers and picked the fastest one (GDC in the case of int128, but LDC2 in the current case).

January 19

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

>

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:

>

I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well.

Additionally if you comparing D by measuring DMD performance - don't.
It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.

I tried with DMD again, and yeah, it's much slower.

Here's the current implementation in D, and the roughly equivalent Rust implementation.

The only "significant" difference is that in Rust, an enum WordOrDigit is used to represent currently known "words"... I did try using that in D, but it made the algorithm slower.

If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know.

Notice that almost all of the time is spent in the for-loop inside printTranslations (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter.

Current performance comparison:

Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787

So , it's not really 3x slower anymore, here's the "D overhead" considering Rust as the baseline:

1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069
January 19
On Fri, Jan 19, 2024 at 01:40:39PM +0000, Renato via Digitalmars-d-learn wrote:
> On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
[...]
> > Additionally if you comparing D by measuring DMD performance - don't.  It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.
> 
> I tried with DMD again, and yeah, it's much slower.

For anything where performance is even remotely important, I wouldn't even consider DMD.  It's a well-known fact that it produces suboptimal executables.  Its only redeeming factor is really only its fast turnaround time.  If fast turnaround is not important, I would always use LDC or GDC instead.


> Here's the [current implementation in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d),
> and the roughly [equivalent Rust
> implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

Taking a look at this code:

One of the most important thing I found is that every call to printTranslations allocates a new array (`auto keyValue = new ubyte[...];`).  Given the large number of recursions involved in this algorithm, this will add up to quite a lot.  If I were optimizing this code, I'd look into ways of reducing, if not eliminating, this allocation.  Observe that this allocation is needed each time printTranslations recurses, so instead of making separate allocations, you could put it on a stack. Either with alloca, or with my appendPath() trick in my version of the code: preallocate a reasonably large buffer and take slices of it each time you need a new keyValue array.

Secondly, your hash function looks suspicious. Why are you chaining your hash per digit?  That's a lot of hash computations.  Shouldn't you just hash the entire key each time?  That would eliminate the need to store a custom hash in your key, you could just lookup the entire key at once.

Next, what stood out is ISolutionHandler.  If I were to write this, I wouldn't use OO for this at all, and especially not interfaces, because they involve a double indirection.  I'd just return a delegate instead (single indirection, no object lookup).  This is a relatively small matter, but when it's being used inside a hot inner loop, it could be important.

Then a minor point: I wouldn't use Array in printTranslations. It's overkill for what you need to do; a built-in array would work just fine. Take a look at the implementation of Array and you'll see lots of function calls and locks and GC root-adding and all that stuff. Most of it doesn't apply here, of course, and is compiled out. Nevertheless, it uses wrapped integer operations and range checks, etc.. Again, these are all minor issues, but in a hot inner loop they do add up. Built-in arrays let you literally just bump the pointer when adding an element. Just a couple of instructions as opposed to several function calls. Important difference when you're on the hot path.  Now, as I mentioned earlier w.r.t. my own code, appending to built-in arrays comes with a cost. So here's where you'd optimize by creating your own buffer and custom push/pop operations. Something like appendPath() in my version of the code would do the job.

Finally, a very a small point: in loadDictionary, you do an AA lookup with `n in result`, and then if that returns null, you create a new entry. This does two AA lookups, once unsuccessfully, and the second time to insert the missing key.  You could use the .require operation with a delegate instead of `in` followed by `if (... is null)`, which only requires a single lookup.  Probably not an important point, but for a large dictionary this might make a difference.


> The only "significant" difference is that in Rust, an enum
> `WordOrDigit` is used to represent currently known "words"... I [did
> try using that in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d),
> but it made the algorithm slower.
> 
> If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know.

Couldn't tell you, I don't know Rust. :-D


> Notice that almost all of the time is spent in the for-loop inside `printTranslations` (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter.
[...]

Of course, that's where your hot path is.  And that loop makes recursive calls to printTranslations, so the entire body of the function could use some optimization. ;-)

Try addressing the points I wrote above and see if it makes a difference.


T

-- 
The two rules of success: 1. Don't tell everything you know. -- YHL
January 19

On Friday, 19 January 2024 at 13:40:39 UTC, Renato wrote:

>

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

>

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:

>

I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well.

Additionally if you comparing D by measuring DMD performance - don't.
It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.

I tried with DMD again, and yeah, it's much slower.

Here's the current implementation in D, and the roughly equivalent Rust implementation.

The only "significant" difference is that in Rust, an enum WordOrDigit is used to represent currently known "words"... I did try using that in D, but it made the algorithm slower.

If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know.

Notice that almost all of the time is spent in the for-loop inside printTranslations (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter.

Current performance comparison:

Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787

So , it's not really 3x slower anymore, here's the "D overhead" considering Rust as the baseline:

1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069

You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching

Yet another reason to advocate for pattern matching in D and switch as expression

January 19

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:

>

You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching

Yet another reason to advocate for pattern matching in D and switch as expression

There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from.

Tried to look up rust implementation and it is SOOO generic that I was unable to decipher it to find the actual key search and stores.

Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/

January 19

On Friday, 19 January 2024 at 08:57:40 UTC, Renato wrote:

>

Do you know why the whole thread seems to have disappeared?? There's a lot of good stuff in the thread, it would be a huge shame to lose all that!

I agree! Thanks for posting your benchmarks, I thought your whole benching setup was pretty good, and learnt alot from your code and the resulting contributions in the thread and others.

Jordan

January 19

On Friday, 19 January 2024 at 17:18:36 UTC, evilrat wrote:

>

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:

>

You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching

Yet another reason to advocate for pattern matching in D and switch as expression

There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from.

Tried to look up rust implementation and it is SOOO generic that I was unable to decipher it to find the actual key search and stores.

Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/

I'm not talking about the difference between the hashmap implementation, but the difference between the algorithm used to lookup the characters

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/9b1d7f026943841638a2729922cf000b1b3ce655/src/java/Main2.java#L106-L134

vs

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/d/src/dencoder.d#L10-L49

If D had pattern matching and switch as expression, the faster method would be:

  1. the most obvious choice

  2. the fastest by default

  3. the most clean

To save from having to write a old-school verbose switch, i suspect he went with a hashmap, wich is slower in that case, hence why i keep advocate for that feature, or i should say, that upgrade to switch, wich java has adopted, as well as rust:

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/rust/phone_encoder/src/main.rs#L146-L168