Looking for more library optimization patterns

Hi all! LDC includes the SimplifyDRuntimeCalls pass which should replace D library calls with more efficient ones. This is a clone of a former LLVM pass which did the same for C runtime calls. An example is that printf("foo\n") is replaced by puts("foo"). Do you know of similar patterns in D which are worth to be implemented? E.g. replacing std.stdio.writefln("foo") with std.stdio.writeln("foo") Regards, Kai

January 30, 2014

Re: Looking for more library optimization patterns

Posted by bearophile
in reply to Kai Nacke

Permalink

bearophile

Posted in reply to Kai Nacke

Permalink

Kai Nacke:

> LDC includes the SimplifyDRuntimeCalls pass which should replace D library calls with more efficient ones. This is a clone of a former LLVM pass which did the same for C runtime calls.
>
> An example is that printf("foo\n") is replaced by puts("foo").
>
> Do you know of similar patterns in D which are worth to be implemented?
> E.g. replacing std.stdio.writefln("foo") with std.stdio.writeln("foo")

A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short:

void main() {
    int[3] a, b;
    a[] += b[];
}

And just rewrite that with a for loop (as done for cases where the run time function is not available), and let ldc2 compile it on its own.

See a thread in D.learn:
http://forum.dlang.org/thread/lqmqsnucadaqlkxkoffc@forum.dlang.org

I'd like to replace code like:


size_t i = 0;
foreach (immutable j, const ref bj; bodies) {
    foreach (const ref bk; bodies[j + 1 .. $]) {
        foreach (immutable m; TypeTuple!(0, 1, 2))
            r[i][m] = bj.x[m] - bk.x[m];
        i++;
    }
}


With:

size_t i = 0;
foreach (immutable j, const ref bj; bodies)
    foreach (const ref bk; bodies[j + 1 .. $])
        r[i++][] = bj.x[] - bk.x[];


And keep the same performance, instead of seeing a significant slowdown.

Bye,
bearophile

Being somewhat involved in aforementioned thread, I second that. There seems to be no logical reason to insert a call dSliceOpAssignOrWhateverItIsCalled when all the data is actually there. At least, in -release -O3 builds :) Granted, this may be tricky for arbitrarily-typed arrays, but for builtins with statically known array sizes? That's a definitive win.

On Monday, 10 February 2014 at 19:51:33 UTC, Kai Nacke wrote: > Thanks! That are good hints! > I am not sure if I can implement them soon. But I put them on my list for the pass. > > If you have more of those hints: please post! > > Regards, > Kai Don't know if it is the case but what about using appender instead of concatenation (for example for concatenations inside loops...)? Andrea

Kai Nacke: > If you have more of those hints: please post! There are plenty of them to find. But I think you need a strategy like this: http://en.wikipedia.org/wiki/Brainstorming to find them. D conferences should be good for this purpose. Bye, bearophile

On Thursday, 30 January 2014 at 18:36:09 UTC, bearophile wrote: > A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short: > > void main() { > int[3] a, b; > a[] += b[]; > } > > And just rewrite that with a for loop (as done for cases where the run time function is not available), and let ldc2 compile it on its own. For short loops, an unrolled version like a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; may well be faster than a simple loop as the following one: foreach (immutable i; 0..3) { a[i] += b[i]; } At least on x86/64. Will that optimization happen too, at a later stage? Ivan Kazmenko.

Ivan Kazmenko: > For short loops, an unrolled version like > a[0] += b[0]; > a[1] += b[1]; > a[2] += b[2]; > may well be faster than a simple loop as the following one: > foreach (immutable i; 0..3) { > a[i] += b[i]; > } > At least on x86/64. Yes, but ldc is plenty able to unroll small loops with length known at compile time. Bye, bearophile

On Thursday, 30 January 2014 at 18:36:09 UTC, bearophile wrote: > > A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short: > I just learned about this issue while looking at http://rosettacode.org/wiki/Hamming_numbers#D (Alternative version 2, opEquals) It makes me sad to learn that slice / array operations get turned into monstrosities by the compiler. There are some huge wins to be had in this area.

Forums