June 12, 2020 How to squeeze more performance out of gcd? | ||||
---|---|---|---|---|
| ||||
So this week I finished a major feature in one of my projects, which involves adding a custom number type that can do exact arithmetic with quadratic irrationals (numbers of the form (a+b*√r)/c with integer a,b,c, and fixed r). The custom number type, unsurprisingly, performs about 5x worse than using `double`, but with the plus side that arithmetic is exact so I never have to worry about round-off errors.
Profiling a typical use case reveals that the most prominent bottleneck (50+% of total running time) is std.numeric.gcd. Inspecting the disassembly shows that it's using the fast binary GCD algorithm (Stein's algorithm) with the bsf and cmovg instructions (this is with `ldc2 -mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK.
Is there a way to squeeze more juice out of gcd? Like is there some novel algorithm or hack that could trim that 50% figure a little?
One of the reasons gcd is a bottleneck is because the quadratic irrationals are normalized after every operation. I contemplated suppressing some of these normalizations in order to boost performance, but just today I had to fix a major bug caused by coefficient overflow, so I'm not keen on postponing normalization unless there's no other way around it.
T
--
For every argument for something, there is always an equal and opposite argument against it. Debates don't give answers, only wounded or inflated egos.
|
June 13, 2020 Re: How to squeeze more performance out of gcd? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Friday, 12 June 2020 at 23:27:35 UTC, H. S. Teoh wrote: > So this week I finished a major feature in one of my projects, which involves adding a custom number type that can do exact arithmetic with quadratic irrationals (numbers of the form (a+b*√r)/c with integer a,b,c, and fixed r). The custom number type, unsurprisingly, performs about 5x worse than using `double`, but with the plus side that arithmetic is exact so I never have to worry about round-off errors. > > Profiling a typical use case reveals that the most prominent bottleneck (50+% of total running time) is std.numeric.gcd. Inspecting the disassembly shows that it's using the fast binary GCD algorithm (Stein's algorithm) with the bsf and cmovg instructions (this is with `ldc2 -mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK. > > Is there a way to squeeze more juice out of gcd? Like is there some novel algorithm or hack that could trim that 50% figure a little? > [...] > T I was doing some competitive programming last month and fast-gcd came up, this was the fastest I found back then: https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Of course it turned out the actual fast approach to my problem was to skip the gcd calculations entirely, because they were irrelevant, but that blog post may help you. |
Copyright © 1999-2021 by the D Language Foundation