Help optimize D solution to phone encoding problem: extremely slow performace. (page 6)

Settings

Help

Index » Learn » Help optimize D solution to phone encoding problem: extremely slow performace. (page 6)

January 20

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Daniel Kozak

Permalink

Daniel Kozak

Attachments:

text/html part

Permalink

On Fri, Jan 19, 2024 at 4:44 PM H. S. Teoh via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> Taking a look at this code:
> ...


> Try addressing the points I wrote above and see if it makes a difference.
>
>
I have tried it (all of it) even before you wrote it here, because I have
completely the same ideas, but to be fair it has almost zero effect on
speed.
There is my version (It still use OOP, but I have try it wit Printer and
Counter to be structs and it has no effect at all)
https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY

The only difference in speed in the end is caused by hash implementation of dlang associative arrays and rust HashMap, actually if you modify rust to not used ahash it has almost same speed as D

January 19

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by H. S. Teoh

Permalink

H. S. Teoh

Permalink

On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn wrote: [...]
>    > Try addressing the points I wrote above and see if it makes a
>    > difference.
> 
>    I have tried it (all of it) even before you wrote it here, because
>    I have completely the same ideas, but to be fair it has almost zero
>    effect on speed.
>    There is my version (It still use OOP, but I have try it wit
>    Printer and Counter to be structs and it has no effect at
>    all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
>    The only difference in speed in the end is caused by hash
>    implementation of dlang associative arrays and rust HashMap,
>    actually if you modify rust to not used ahash it has almost same
>    speed as D
[...]

I'm confused by the chained hashing of the digits. Why is that necessary?  I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together.

I looked up Rust's ahash algorithm. Apparently they leverage the CPU's hardware AES instruction to compute a collision-resistant hash very quickly.

Somebody should file a bug on druntime to implement this where the hardware supports it, instead of the current hashOf. For relatively small keys this would be a big performance boost.

T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.

January 20

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Siarhei Siamashka
in reply to H. S. Teoh

Permalink

Siarhei Siamashka

Posted in reply to H. S. Teoh

Permalink

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:

On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn wrote: [...]

> >

Try addressing the points I wrote above and see if it makes a
difference.

I have tried it (all of it) even before you wrote it here, because
I have completely the same ideas, but to be fair it has almost zero
effect on speed.
There is my version (It still use OOP, but I have try it wit
Printer and Counter to be structs and it has no effect at
all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
The only difference in speed in the end is caused by hash
implementation of dlang associative arrays and rust HashMap,
actually if you modify rust to not used ahash it has almost same
speed as D
[...]

I'm confused by the chained hashing of the digits. Why is that necessary? I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together.

Hash which key? The problem of the naive hashtable-based algorithm is that we don't know the length of the key. So all lengths are tried starting from 1 and up to the remaining part of the phone number. An additional optimization would be to limit this search and terminate it early. For example, stop after we exceed the length of the longest word from the dictionary. Or stop the search upon encountering certain "leaf" words, but this effectively morphs the algorithm into something that partially resembles Trie. And further evolving it we may end up with a full-fledged Trie.

The only practical value of this algorithm is that it's easy to implement and has low LOC count. My initial non-optimized naive version was also similar https://gist.github.com/ssvb/abe821b3cdba7fcb7f3c3cecde864153 (66 lines including blank lines and some comments). This is somewhat comparable to the Lisp results https://flownet.com/ron/papers/lisp-java/raw-results.html or to the Peter Norvig's solution http://www.norvig.com/java-lisp.html

The purpose of the initial implementation "prototype" is just to have some sort of a starting point. And if the runtime performance doesn't matter, then we are already done. But if the runtime performance does matter, then a lot of things can be rapidly improved by spending a bit of time on it. Eventually one runs out of ideas and further optimization efforts yield only diminishing returns. That's when it's a good time to clean up the code, add comments, etc. and label it as the "final" version.

I think that a good study, that intends to compare different programming languages against each other, could take all these factors into consideration: time to implement the "prototype", runtime performance of the prototype, time to implement the "final" version, runtime performance of the final version, the number of lines of code and its readability/maintainability.

January 20

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Renato
in reply to H. S. Teoh

Permalink

Renato

Posted in reply to H. S. Teoh

Permalink

Wow, fantastic feedback from lots of people, thanks so much!

I will try to answer all points raised in the several answers I've got here since yesterday.

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:

I looked up Rust's ahash algorithm. Apparently they leverage the CPU's hardware AES instruction to compute a collision-resistant hash very quickly.

Somebody should file a bug on druntime to implement this where the hardware supports it, instead of the current hashOf. For relatively small keys this would be a big performance boost.

This is the commit I added "incremental hashing". What you're suggesting I also tried but it was much slower.

The reason incremental hashing is much faster than computing the hash of the whole key on each iteration is that you're doing a lot less work. To compute the hash of an array fully, you need to do something like this (it's just how hashing works):

auto currentHash = <something>;
foreach(item: array) {
  currentHash = hashOf(item);
}

But by the structure of the problem, we know that we keep adding items to the array and re-computing the hash (in the printTranslations function, which is the hot path... so ignore the dictionary loading for now). What you're suggesting is do the above for the full array, on each iteration... what I did was optmise that so that we only re-compute the hash for each item instead per iteration - which is the only work that is required.

As Siarhei Siamashka mentioned: you could keep optimising this using heuristics and knowledge about the problem, at which point you might converge to a Trie implementation! But for the purpose of this comparison, I'm happy to stop at a good enough, kind of general purpose algorithm which is comparable in multiple languages... I might compare the D Trie impl with a Rust Trie impl in the future, but for now, I've had enough of that.

Hope that makes sense now... try to undo it and I believe it will run around 30% slower.

One of the most important thing I found is that every call to printTranslations allocates a new array (auto keyValue = new ubyte[...];). Given the large number of recursions involved in this algorithm, this will add up to quite a lot.

Bingo! I found that before reading your suggestion :) you can see the commit where I changed that here. This one-line change was the most effective change, making the D impl not only consume a lot less memory, but become quite a bit faster (I will double check later , but I believe this made the code run almost 50% faster in the longest runs)!

Secondly, your hash function looks suspicious. Why are you chaining your hash per digit?

I've answered that above... but I went further and instead of using the built-in hashof function, I decided to use my own hashing function which takes advantage of the input characteristics... I mentioned before the trick I did to make the int128 solution faster (by the way, the int128 solution was not acceptable, as @ssvb pointed out, that's why I had to remove that). This is a very simple hashing function that works well enough for this particular problem, and this also boosted performance noticeably.

Then a minor point: I wouldn't use Array in printTranslations. It's overkill for what you need to do; a built-in array would work just fine.

Hm, sorry to break it to you but changing that to an Array was the single greatest improvement I had. IIRC it made the code 2x faster.

Maybe I was using native arrays wrongly before... but it's unclear to me how I could've made that better (and the docs themselves suggest to use Array if performance matters, which I seem to have confirmed is good advice).

Next, what stood out is ISolutionHandler. If I were to write this, I wouldn't use OO for this at all, and especially not interfaces, because they involve a double indirection.

As mentioned by Daniel Kozak, this doesn't matter. I tried rewriting it so that there was no indirection to the extent possible (I used delegate fields within a final class so the functions to use are selected at startup, so there's no overhead choosing which functions to call at runtime) and it was actually slightly slower (but too close to call really). My solution with interfaces had two final classes used, so I think there was very little overhead as no vtable was needed, right? You can check my change here, let me know if you can find a flaw in it.

Finally, a very a small point: in loadDictionary, you do an AA lookup with n in result, and then if that returns null, you create a new entry. This does two AA lookups, once unsuccessfully, and the second time to insert the missing key. You could use the .require operation with a delegate instead of in followed by if (... is null), which only requires a single lookup.

Thanks for pointing that out. I changed that to use require, but it made no perceptible difference... still I kept this in my "final" implementation because it was much cleaner!

ryuukk_:
You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching

I rewrote that to use a switch. The speed seems to have increased very slightly for the very small inputs, it's difficult to tell.
But I kept this change anyway because it was simpler and closer to the Rust implementation.

Notice that this table lookup is no in the hot path, so as expected, it only seems to speed up slightly loading the dictionary (which is good because D was behind for small inputs - and even this tiny improvement could help D catch up).

To save from having to write a old-school verbose switch, i suspect he went with a hashmap, wich is slower in that case, hence why i keep advocate for that feature, or i should say, that upgrade to switch, wich java has adopted, as well as rust:

The reason I used a hash table was that I believed that would be taking advantage of D's compile-time programming and would be even faster than a switch. To be completely sure that's not the case, I would need to benchmark that function individually, but I'm not so interested in knowing that as that made almost no difference in the implementation's performance anyway.

Yet another reason to advocate for pattern matching in D and switch as expression

Hm... not sure your argument holds here, given the above :/

evilrat:
There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from.

I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called ahash.

If you're interested in knowing more about that, please read my blog post about optimising the Rust code.

So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D.

You can see my custom hashing function here. This function is not good for the general purpose case, but it's probably as good as it gets for this problem (see my int128 trick in a previous post on this thread to understand why). I did try a few variations of hashing before settling on this one... Rust, if anything, is at a disadvantage here.

One more thing I did was to avoid allocations in the print function so that D could have less memory allocations (which proved to be a big slowdown already) and perform better on the print problem.

This change did make the code faster and it actually caught up with Rust for the smallest input.

Jordan Wilson:
I agree! Thanks for posting your benchmarks, I thought your whole benching setup was pretty good, and learnt alot from your code and the resulting contributions in the thread and others.

Thanks for that :) I am used to being mercilessly roosted in threads like this because no one likes to be told their favourite language is not so fast :D so any support is highly appreciated!

Anyway, here's the fastest implementation I came up with based on all the feedback here that still resembles the Common Lisp algorithm and the Rust implementation:

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/1ee3d858466904a74468eb139675ec7cf2c690d0/src/d/src/dencoder.d

(branch name is dlang-key-hash-incremental-avoid-gc)

Performance on Mac M1:

Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,36
./rust,24100864,144
./rust,23986176,581
./rust,24100864,1152
./rust,7815168,6644
./rust,7766016,11752
===> src/d/dencoder
src/d/dencoder,13385728,31
src/d/dencoder,24477696,262
src/d/dencoder,24477696,1173
src/d/dencoder,24395776,2313
src/d/dencoder,5701632,7108
src/d/dencoder,5767168,13692

Performance on Linux x86_64 (compiled with LDC 1.36.0 and dub -b release-nobounds):

Proc,Memory(bytes),Time(ms)

===> ./rust
./rust,23138304,62
./rust,23138304,236
./rust,23138304,932
./rust,23138304,1828
./rust,9027584,6108
./rust,9027584,13759
===> src/d/dencoder
src/d/dencoder,229851136,62
src/d/dencoder,229851136,403
src/d/dencoder,229851136,1789
src/d/dencoder,229851136,3573
src/d/dencoder,218996736,8254
src/d/dencoder,218996736,21412

Performance on Linux x86_64 using GDC:

Proc,Run,Memory(bytes),Time(ms)

===> ./rust
./rust,25178112,57
./rust,25178112,262
./rust,25178112,1082
./rust,25178112,2084
./rust,11067392,6328
./rust,11067392,35849
===> src/d/dencoder
src/d/dencoder,231968768,72
src/d/dencoder,231968768,635
src/d/dencoder,231968768,2875
src/d/dencoder,231968768,5739
src/d/dencoder,221110272,18195
src/d/dencoder,221110272,128987

D overhead with the fastest compiler, LDC, compared with Rust:

1.0
1.707627119
1.919527897
1.954595186
1.351342502
1.556217748

If anyone ever asks me, I will say that D can be as fast as Rust/C, but only if you're very careful using D features that perform well... if you just write clean D code (use the GC, native arrays etc.), you're likely to get at least 50% slower code.

Thanks everyone for all the input.

January 20

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Renato
in reply to Renato

Permalink

Renato

Posted in reply to Renato

Permalink

On Saturday, 20 January 2024 at 13:07:39 UTC, Renato wrote:

D overhead with the fastest compiler, LDC, compared with Rust:

1.0
1.707627119
1.919527897
1.954595186
1.351342502
1.556217748

Oh sorry, I only posted the rates for the Linux benchmark...

On Mac M1, for some reason, D was a bit closer to Rust in some of the runs:

0.8611111111
1.819444444
2.018932874
2.0078125
1.069837447
1.165078285

January 20

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Daniel Kozak
in reply to Renato

Permalink

Daniel Kozak

Posted in reply to Renato

Attachments:

text/html part

Permalink

On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> Wow, fantastic feedback from lots of people, thanks so much! ...
>
> > evilrat:
> > There is another important difference, i quickly looked up D
> > associative array implementation and the search looks like
> > nlog(n) complexity with plain loop iteration of hashes, whereas
> > rust's implementation is using "swisstable" algorithm
> > implementation that has packed SIMD optimized lookups, this is
> > likely where the 3x speed difference comes from.
>
> I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`.
>
> If you're interested in knowing more about that, please [read my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code.
>
> So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D.
>
> You can [see my custom hashing function
> here](
> https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3).
> This function is not good for the general purpose case, but it's probably
> as good as it gets for this problem (see my int128 trick in a previous post
> on this thread to understand why). I did try a few variations of hashing
> before settling on this one... Rust, if anything, is at a disadvantage here.
>

And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented.

January 20

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Renato
in reply to Daniel Kozak

Permalink

Renato

Posted in reply to Daniel Kozak

Permalink

On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:

On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

Wow, fantastic feedback from lots of people, thanks so much! ...

evilrat:
There is another important difference, i quickly looked up D
associative array implementation and the search looks like
nlog(n) complexity with plain loop iteration of hashes, whereas
rust's implementation is using "swisstable" algorithm
implementation that has packed SIMD optimized lookups, this is
likely where the 3x speed difference comes from.

If you're interested in knowing more about that, please read my blog post about optimising the Rust code.

So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D.

You can see my custom hashing function
here.
This function is not good for the general purpose case, but it's probably
as good as it gets for this problem (see my int128 trick in a previous post
on this thread to understand why). I did try a few variations of hashing
before settling on this one... Rust, if anything, is at a disadvantage here.

And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented.

I've explained why this function is almost a perfect hash function for this problem (there will be almost no collisions except for very large inputs where the shift-left will "overflow", and even then the probability of collisions should be very low). If you're going to claim I'm wrong, you must show exactly where I'm wrong, and preferrably you should provide a faster implementation. I will even accept a theoretical explanation if you can give one. What I will not accept, is for you to call me "wrong" just because you say so. That's childish behaviour and uncalled for on a technical discussion.

January 21

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Posted by Daniel Kozak
in reply to Renato

Permalink

Daniel Kozak

Posted in reply to Renato

Attachments:

text/html part

Permalink

Dne so 20. 1. 2024 21:21 uživatel Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> napsal:

> On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
> > On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
> >
> >> Wow, fantastic feedback from lots of people, thanks so much! ...
> >>
> >> > evilrat:
> >> > There is another important difference, i quickly looked up D
> >> > associative array implementation and the search looks like
> >> > nlog(n) complexity with plain loop iteration of hashes,
> >> > whereas
> >> > rust's implementation is using "swisstable" algorithm
> >> > implementation that has packed SIMD optimized lookups, this
> >> > is
> >> > likely where the 3x speed difference comes from.
> >>
> >> I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`.
> >>
> >> If you're interested in knowing more about that, please [read
> >> my blog
> >> post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code)
> about optimising the Rust code.
> >>
> >> So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D.
> >>
> >> You can [see my custom hashing function
> >> here](
> >>
> https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3 ).
> >> This function is not good for the general purpose case, but
> >> it's probably
> >> as good as it gets for this problem (see my int128 trick in a
> >> previous post
> >> on this thread to understand why). I did try a few variations
> >> of hashing
> >> before settling on this one... Rust, if anything, is at a
> >> disadvantage here.
> >>
> >
> > And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented.
>
> I've explained why this function is almost a perfect hash function for this problem (there will be almost no collisions except for very large inputs where the shift-left will "overflow", and even then the probability of collisions should be very low). If you're going to claim I'm wrong, you must show exactly where I'm wrong, and preferrably you should provide a faster implementation. I will even accept a theoretical explanation if you can give one. What I will not accept, is for you to call me "wrong" just because you say so. That's childish behaviour and uncalled for on a technical discussion.
>


If you provided info that hash is perfect, than I am sorry. I have probably missed that in this thread my fault. Than I will need to looked into this more carefully and do much more than just assumption.

>

Top | Forum index | About this forum

Forums