Thread overview
Libdivide ported to D
May 14, 2017
Tomer Filiba
May 14, 2017
Walter Bright
May 14, 2017
David Nadlinger
May 15, 2017
Jonathan M Davis
May 15, 2017
Walter Bright
May 14, 2017
https://code.dlang.org/packages/divide

Libdivide (http://libdivide.com/) allows converting the DIV instruction (in runtime) to a series of shifts and MULs, which is much more efficient in execution time. It works by taking a number (the divisor or "denominator") and doing some preprocessing to it, after which dividing by it can be ~8 times faster (my own measurements). I use it to divide CPU cycles by the CPU frequency (i.e., two big ugly numbers) to obtain wall time from it.

Of course it only applies to runtime division -- the compiler can do the same if the divisor is known in compile time.

Notes:
* It's a header-only library so I ported the code itself
* I tried to keep my port as mechanical as possible; I can't really say I know what's going on there
* I only ported the POSIX x86-64 code because that's what I needed
* Signes-ness is a big issue, be sure you use the right variant


-tomer
May 14, 2017
On 5/14/2017 3:39 AM, Tomer Filiba wrote:
> https://code.dlang.org/packages/divide
>
> Libdivide (http://libdivide.com/) allows converting the DIV instruction (in
> runtime) to a series of shifts and MULs, which is much more efficient in
> execution time. It works by taking a number (the divisor or "denominator") and
> doing some preprocessing to it, after which dividing by it can be ~8 times
> faster (my own measurements). I use it to divide CPU cycles by the CPU frequency
> (i.e., two big ugly numbers) to obtain wall time from it.
>
> Of course it only applies to runtime division -- the compiler can do the same if
> the divisor is known in compile time.

I hate to say this, but modern compilers already do this for generated runtime code when dividing by a constant. Here's dmd's version:

  https://github.com/dlang/dmd/blob/master/src/ddmd/backend/divcoeff.c
May 14, 2017
On Sunday, 14 May 2017 at 15:30:19 UTC, Walter Bright wrote:
> On 5/14/2017 3:39 AM, Tomer Filiba wrote:
>> Of course it only applies to runtime division -- the compiler can do the same if
>> the divisor is known in compile time.
>
> I hate to say this, but modern compilers already do this for generated runtime code when dividing by a constant. […]

As Tomer points out, a runtime implementation can still be useful if the divisor is only known dynamically (but constant across number of operations).

 — David
May 15, 2017
On Sunday, May 14, 2017 16:20:21 David Nadlinger via Digitalmars-d-announce wrote:
> On Sunday, 14 May 2017 at 15:30:19 UTC, Walter Bright wrote:
> > On 5/14/2017 3:39 AM, Tomer Filiba wrote:
> >> Of course it only applies to runtime division -- the compiler
> >> can do the same if
> >> the divisor is known in compile time.
> >
> > I hate to say this, but modern compilers already do this for generated runtime code when dividing by a constant. […]
>
> As Tomer points out, a runtime implementation can still be useful if the divisor is only known dynamically (but constant across number of operations).

Liran was telling me last year about how the folks at Weka had used this to speed up the stuff in core.time and std.datetime in their local branch and wanted me to look into updating the official implementation to use it (unfortunately, I haven't had the time to spend on D that I would have liked and haven't managed to look into that yet - though that would require putting at least some of this in druntime). I confess though that I was highly confused about the whole thing, because it sounded like this was an optimization that the compiler already did, and yet the Weka guys were having to use libdivide some portion of the time. I suppose that it makes sense though if the issue is that the divisor is not known until runtime. But unfortunately, my understanding of compiler optimizations like this is fairly poor.

- Jonathan M Davis


May 15, 2017
On 5/15/2017 3:51 AM, Jonathan M Davis via Digitalmars-d-announce wrote:
> Liran was telling me last year about how the folks at Weka had used this to
> speed up the stuff in core.time and std.datetime in their local branch and
> wanted me to look into updating the official implementation to use it
> (unfortunately, I haven't had the time to spend on D that I would have liked
> and haven't managed to look into that yet - though that would require
> putting at least some of this in druntime). I confess though that I was
> highly confused about the whole thing, because it sounded like this was an
> optimization that the compiler already did, and yet the Weka guys were
> having to use libdivide some portion of the time. I suppose that it makes
> sense though if the issue is that the divisor is not known until runtime.
> But unfortunately, my understanding of compiler optimizations like this is
> fairly poor.


One can do things like this:

    if (divisor == 10)
        foreach (i; 1..1000)
            result += i / 10;  // compiler generates faster code here
    else
         foreach (i; 1..1000)
            result += i / divisor;

if one knows in advance popular values of divisor. This sort of thing is already done in Phobos when converting numbers <==> strings (optimizing for radix==10).