Thread overview
Looking for more library optimization patterns
Jan 30, 2014
Kai Nacke
Jan 30, 2014
bearophile
Jan 31, 2014
Stanislav Blinov
Feb 10, 2014
Kai Nacke
Feb 11, 2014
Andrea Fontana
Feb 11, 2014
bearophile
Feb 15, 2014
Ivan Kazmenko
Feb 15, 2014
bearophile
Apr 18, 2014
Temtaime
Jul 30, 2014
safety0ff
January 30, 2014
Hi all!

LDC includes the SimplifyDRuntimeCalls pass which should replace D library calls with more efficient ones. This is a clone of a former LLVM pass which did the same for C runtime calls.

An example is that printf("foo\n") is replaced by puts("foo").

Do you know of similar patterns in D which are worth to be implemented?
E.g. replacing std.stdio.writefln("foo") with std.stdio.writeln("foo")

Regards,
Kai
January 30, 2014
Kai Nacke:

> LDC includes the SimplifyDRuntimeCalls pass which should replace D library calls with more efficient ones. This is a clone of a former LLVM pass which did the same for C runtime calls.
>
> An example is that printf("foo\n") is replaced by puts("foo").
>
> Do you know of similar patterns in D which are worth to be implemented?
> E.g. replacing std.stdio.writefln("foo") with std.stdio.writeln("foo")

A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short:

void main() {
    int[3] a, b;
    a[] += b[];
}

And just rewrite that with a for loop (as done for cases where the run time function is not available), and let ldc2 compile it on its own.

See a thread in D.learn:
http://forum.dlang.org/thread/lqmqsnucadaqlkxkoffc@forum.dlang.org

I'd like to replace code like:


size_t i = 0;
foreach (immutable j, const ref bj; bodies) {
    foreach (const ref bk; bodies[j + 1 .. $]) {
        foreach (immutable m; TypeTuple!(0, 1, 2))
            r[i][m] = bj.x[m] - bk.x[m];
        i++;
    }
}


With:

size_t i = 0;
foreach (immutable j, const ref bj; bodies)
    foreach (const ref bk; bodies[j + 1 .. $])
        r[i++][] = bj.x[] - bk.x[];


And keep the same performance, instead of seeing a significant slowdown.

Bye,
bearophile
January 31, 2014
Being somewhat involved in aforementioned thread, I second that. There seems to be no logical reason to insert a call dSliceOpAssignOrWhateverItIsCalled when all the data is actually there. At least, in -release -O3 builds :)

Granted, this may be tricky for arbitrarily-typed arrays, but for builtins with statically known array sizes? That's a definitive win.
February 10, 2014
Thanks! That are good hints!
I am not sure if I can implement them soon. But I put them on my list for the pass.

If you have more of those hints: please post!

Regards,
Kai
February 11, 2014
On Monday, 10 February 2014 at 19:51:33 UTC, Kai Nacke wrote:
> Thanks! That are good hints!
> I am not sure if I can implement them soon. But I put them on my list for the pass.
>
> If you have more of those hints: please post!
>
> Regards,
> Kai

Don't know if it is the case but what about using appender instead of concatenation (for example for concatenations inside loops...)?

Andrea
February 11, 2014
Kai Nacke:

> If you have more of those hints: please post!

There are plenty of them to find. But I think you need a strategy like this: http://en.wikipedia.org/wiki/Brainstorming  to find them. D conferences should be good for this purpose.

Bye,
bearophile
February 15, 2014
On Thursday, 30 January 2014 at 18:36:09 UTC, bearophile wrote:
> A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short:
>
> void main() {
>     int[3] a, b;
>     a[] += b[];
> }
>
> And just rewrite that with a for loop (as done for cases where the run time function is not available), and let ldc2 compile it on its own.

For short loops, an unrolled version like
    a[0] += b[0];
    a[1] += b[1];
    a[2] += b[2];
may well be faster than a simple loop as the following one:
    foreach (immutable i; 0..3) {
        a[i] += b[i];
    }
At least on x86/64.
Will that optimization happen too, at a later stage?

Ivan Kazmenko.
February 15, 2014
Ivan Kazmenko:

> For short loops, an unrolled version like
>     a[0] += b[0];
>     a[1] += b[1];
>     a[2] += b[2];
> may well be faster than a simple loop as the following one:
>     foreach (immutable i; 0..3) {
>         a[i] += b[i];
>     }
> At least on x86/64.

Yes, but ldc is plenty able to unroll small loops with length known at compile time.

Bye,
bearophile
April 18, 2014
What about SSE?

Can

float[16] a, b;
a[] += b[];

Be optimized by using only 4 addps ?
July 30, 2014
On Thursday, 30 January 2014 at 18:36:09 UTC, bearophile wrote:
>
> A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short:
>

I just learned about this issue while looking at http://rosettacode.org/wiki/Hamming_numbers#D
(Alternative version 2, opEquals)

It makes me sad to learn that slice / array operations get turned into monstrosities by the compiler.
There are some huge wins to be had in this area.