Jump to page: 1 26  
Page
Thread overview
Reducing the cost of autodecoding
Oct 12, 2016
Stefan Koch
Oct 12, 2016
safety0ff
Oct 12, 2016
Stefan Koch
Oct 12, 2016
safety0ff
Oct 13, 2016
Kagamin
Oct 13, 2016
safety0ff
Oct 12, 2016
Stefan Koch
Oct 13, 2016
Stefan Koch
Oct 13, 2016
Stefan Koch
Oct 13, 2016
Johan Engelen
Oct 13, 2016
Jacob Carlborg
Oct 13, 2016
safety0ff
Oct 13, 2016
safety0ff
Oct 13, 2016
Stefan Koch
Oct 13, 2016
safety0ff
Oct 14, 2016
Stefan Koch
Oct 15, 2016
Stefan Koch
Oct 15, 2016
Patrick Schluter
Oct 15, 2016
Patrick Schluter
Oct 15, 2016
Uplink_Coder
Oct 15, 2016
Patrick Schluter
Oct 15, 2016
safety0ff
Oct 15, 2016
Patrick Schluter
Oct 15, 2016
Patrick Schluter
Oct 15, 2016
Uplink_Coder
Oct 15, 2016
Stefan Koch
Oct 16, 2016
Patrick Schluter
Oct 16, 2016
Patrick Schluter
Oct 16, 2016
Uplink_Coder
Oct 16, 2016
NoBrainer
Oct 16, 2016
Patrick Schluter
Oct 16, 2016
Patrick Schluter
Oct 16, 2016
Patrick Schluter
Oct 17, 2016
ketmar
Oct 15, 2016
safety0ff
Oct 12, 2016
Ilya Yaroshenko
Oct 12, 2016
Ilya Yaroshenko
Oct 12, 2016
Johan Engelen
Oct 25, 2016
Dmitry Olshansky
Oct 25, 2016
Stefam Koch
Oct 25, 2016
safety0ff
Oct 25, 2016
safety0ff
October 12, 2016
So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.

Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:

=====
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
    import std.algorithm, std.array, std.stdio;
    char[] line = "0123456789".dup.repeat(50_000_000).join;
    ulong checksum;
    if (args.length == 1)
    {
        while (line.length) {
            version(autodecode)
            {
                checksum += line.myFront;
                line.myPopFront;
            }
            else
            {
                checksum += line[0];
                line = line[1 .. $];
            }
        }
        version(autodecode)
            writeln("autodecode ", checksum);
        else
            writeln("bytes ", checksum);
    }
    else
        writeln("overhead");
}
=====

On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding.

Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.


Andrei
October 12, 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>
> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>
> =====
> import std.range;
>
> alias myPopFront = std.range.popFront;
> alias myFront = std.range.front;
>
> void main(string[] args) {
>     import std.algorithm, std.array, std.stdio;
>     char[] line = "0123456789".dup.repeat(50_000_000).join;
>     ulong checksum;
>     if (args.length == 1)
>     {
>         while (line.length) {
>             version(autodecode)
>             {
>                 checksum += line.myFront;
>                 line.myPopFront;
>             }
>             else
>             {
>                 checksum += line[0];
>                 line = line[1 .. $];
>             }
>         }
>         version(autodecode)
>             writeln("autodecode ", checksum);
>         else
>             writeln("bytes ", checksum);
>     }
>     else
>         writeln("overhead");
> }
> =====
>
> On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding.
>
> Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.
>
>
> Andrei

This will only work really efficiently with some state on the stack.
If we are to support Unicode.
October 12, 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>
> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>
> =====
> import std.range;
>
> alias myPopFront = std.range.popFront;
> alias myFront = std.range.front;
>
> void main(string[] args) {
>     import std.algorithm, std.array, std.stdio;
>     char[] line = "0123456789".dup.repeat(50_000_000).join;

For fair test line should be feet into L1 cache. --Ilya
October 12, 2016
On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko wrote:
> On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
>> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>>
>> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>>
>> =====
>> import std.range;
>>
>> alias myPopFront = std.range.popFront;
>> alias myFront = std.range.front;
>>
>> void main(string[] args) {
>>     import std.algorithm, std.array, std.stdio;
>>     char[] line = "0123456789".dup.repeat(50_000_000).join;
>
> For fair test line should be feet into L1 cache. --Ilya

EDITL For fair test the line should be in the L1 cache. --Ilya

October 12, 2016
On 10/12/2016 12:03 PM, Stefan Koch wrote:
> This will only work really efficiently with some state on the stack.

Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
October 12, 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> 
> On my machine, with "ldc2 -release -O3 -enable-inlining"

"-O3 -enable-inlining" is synonymous with "-O3"  :-)

With LDC 1.1.0-beta3, you can try with "-enable-cross-module-inlining". It won't cross-module inline everything (notably: nested functions), but it's a start. If you see large performance differences, let me know.

cheers,
  Johan

October 12, 2016
On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>
> Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei

My measurements:
ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0

overhead 0.350s
bytes    0.385s
current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s

With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
October 12, 2016
On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:
> On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>>
>> Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
>
> My measurements:
> ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
> ldc version 1.0.0
>
> overhead 0.350s
> bytes    0.385s
> current autodecoding 0.915s (with new LUT popFront)
> copy-pasting std.utf decoding functions into current file 0.840s
> adding ASCII branch hints (llvm_expect) 0.770s
>
> With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.

where did you apply the branch hints ?
October 12, 2016
On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote:
>
> where did you apply the branch hints ?

Code: http://pastebin.com/CFCpUftW
October 12, 2016
On 10/12/2016 04:02 PM, safety0ff wrote:
> On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>>
>> Remember the ASCII part is the bothersome one. There's only two
>> comparisons, all with 100% predictability. We should be able to
>> arrange matters so the loss is negligible. -- Andrei
>
> My measurements:
> ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
> ldc version 1.0.0
>
> overhead 0.350s
> bytes    0.385s

Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already?

> current autodecoding 0.915s (with new LUT popFront)
> copy-pasting std.utf decoding functions into current file 0.840s
> adding ASCII branch hints (llvm_expect) 0.770s
>
> With the branch hints LDC moves the non-Ascii code outside of the loop
> and creates a really tight loop body.

I think we should define two aliases "likely" and "unlikely" with default implementations:

bool likely(bool b) { return b; }
bool unlikely(bool b) { return b; }

They'd go in druntime. Then implementers can hook them into their intrinsics.

Works?


Andrei
« First   ‹ Prev
1 2 3 4 5 6