June 12, 2020
So this week I finished a major feature in one of my projects, which involves adding a custom number type that can do exact arithmetic with quadratic irrationals (numbers of the form (a+b*√r)/c with integer a,b,c, and fixed r). The custom number type, unsurprisingly, performs about 5x worse than using `double`, but with the plus side that arithmetic is exact so I never have to worry about round-off errors.

Profiling a typical use case reveals that the most prominent bottleneck (50+% of total running time) is std.numeric.gcd.  Inspecting the disassembly shows that it's using the fast binary GCD algorithm (Stein's algorithm) with the bsf and cmovg instructions (this is with `ldc2 -mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK.

Is there a way to squeeze more juice out of gcd?  Like is there some novel algorithm or hack that could trim that 50% figure a little?

One of the reasons gcd is a bottleneck is because the quadratic irrationals are normalized after every operation. I contemplated suppressing some of these normalizations in order to boost performance, but just today I had to fix a major bug caused by coefficient overflow, so I'm not keen on postponing normalization unless there's no other way around it.


T

-- 
For every argument for something, there is always an equal and opposite argument against it. Debates don't give answers, only wounded or inflated egos.
June 13, 2020
On Friday, 12 June 2020 at 23:27:35 UTC, H. S. Teoh wrote:
> So this week I finished a major feature in one of my projects, which involves adding a custom number type that can do exact arithmetic with quadratic irrationals (numbers of the form (a+b*√r)/c with integer a,b,c, and fixed r). The custom number type, unsurprisingly, performs about 5x worse than using `double`, but with the plus side that arithmetic is exact so I never have to worry about round-off errors.
>
> Profiling a typical use case reveals that the most prominent bottleneck (50+% of total running time) is std.numeric.gcd.  Inspecting the disassembly shows that it's using the fast binary GCD algorithm (Stein's algorithm) with the bsf and cmovg instructions (this is with `ldc2 -mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK.
>
> Is there a way to squeeze more juice out of gcd?  Like is there some novel algorithm or hack that could trim that 50% figure a little?
> [...]
> T

I was doing some competitive programming last month and fast-gcd came up, this was the fastest I found back then: https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/

Of course it turned out the actual fast approach to my problem was to skip the gcd calculations entirely, because they were irrelevant, but that blog post may help you.