Reducing the cost of autodecoding - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Reducing the cost of autodecoding

Thread overview

Reducing the cost of autodecoding
Oct 12, 2016 Andrei Alexandrescu
Oct 12, 2016 Stefan Koch
Oct 12, 2016 Andrei Alexandrescu
Oct 12, 2016 safety0ff
Oct 12, 2016 Stefan Koch
Oct 12, 2016 safety0ff
Oct 13, 2016 Kagamin
Oct 13, 2016 safety0ff
Oct 12, 2016 Andrei Alexandrescu
Oct 12, 2016 Stefan Koch
Oct 13, 2016 Stefan Koch
Oct 13, 2016 Andrei Alexandrescu
Oct 13, 2016 Stefan Koch
Oct 13, 2016 Johan Engelen
Oct 13, 2016 Jacob Carlborg
Oct 13, 2016 safety0ff
Oct 13, 2016 safety0ff
Oct 13, 2016 Andrei Alexandrescu
Oct 13, 2016 Stefan Koch
Oct 13, 2016 Andrei Alexandrescu
Oct 13, 2016 safety0ff
Oct 14, 2016 Stefan Koch
Oct 15, 2016 Stefan Koch
Oct 15, 2016 Patrick Schluter
Oct 15, 2016 Patrick Schluter
Oct 15, 2016 Andrei Alexandrescu
Oct 15, 2016 Uplink_Coder
Oct 15, 2016 Patrick Schluter
Oct 15, 2016 safety0ff
Oct 15, 2016 Patrick Schluter
Oct 15, 2016 Patrick Schluter
Oct 15, 2016 Uplink_Coder
Oct 15, 2016 Stefan Koch
Oct 16, 2016 Patrick Schluter
Oct 16, 2016 Patrick Schluter
Oct 16, 2016 Uplink_Coder
Oct 16, 2016 NoBrainer
Oct 16, 2016 Patrick Schluter
Oct 16, 2016 Patrick Schluter
Oct 16, 2016 Patrick Schluter
Oct 17, 2016 ketmar
Oct 15, 2016 safety0ff
Oct 12, 2016 Ilya Yaroshenko
Oct 12, 2016 Ilya Yaroshenko
Oct 12, 2016 Johan Engelen
Oct 25, 2016 Dmitry Olshansky
Oct 25, 2016 Steven Schveighoffer
Oct 25, 2016 Stefam Koch
Oct 25, 2016 Steven Schveighoffer
Oct 25, 2016 safety0ff
Oct 25, 2016 safety0ff

October 12, 2016

Reducing the cost of autodecoding

Posted by Andrei Alexandrescu

Andrei Alexandrescu

So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.

Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:

=====
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
    import std.algorithm, std.array, std.stdio;
    char[] line = "0123456789".dup.repeat(50_000_000).join;
    ulong checksum;
    if (args.length == 1)
    {
        while (line.length) {
            version(autodecode)
            {
                checksum += line.myFront;
                line.myPopFront;
            }
            else
            {
                checksum += line[0];
                line = line[1 .. $];
            }
        }
        version(autodecode)
            writeln("autodecode ", checksum);
        else
            writeln("bytes ", checksum);
    }
    else
        writeln("overhead");
}
=====

On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding.

Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.


Andrei

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Stefan Koch
in reply to Andrei Alexandrescu

Stefan Koch

Posted in reply to Andrei Alexandrescu

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>
> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>
> =====
> import std.range;
>
> alias myPopFront = std.range.popFront;
> alias myFront = std.range.front;
>
> void main(string[] args) {
>     import std.algorithm, std.array, std.stdio;
>     char[] line = "0123456789".dup.repeat(50_000_000).join;
>     ulong checksum;
>     if (args.length == 1)
>     {
>         while (line.length) {
>             version(autodecode)
>             {
>                 checksum += line.myFront;
>                 line.myPopFront;
>             }
>             else
>             {
>                 checksum += line[0];
>                 line = line[1 .. $];
>             }
>         }
>         version(autodecode)
>             writeln("autodecode ", checksum);
>         else
>             writeln("bytes ", checksum);
>     }
>     else
>         writeln("overhead");
> }
> =====
>
> On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding.
>
> Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.
>
>
> Andrei

This will only work really efficiently with some state on the stack.
If we are to support Unicode.

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Ilya Yaroshenko
in reply to Andrei Alexandrescu

Ilya Yaroshenko

Posted in reply to Andrei Alexandrescu

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>
> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>
> =====
> import std.range;
>
> alias myPopFront = std.range.popFront;
> alias myFront = std.range.front;
>
> void main(string[] args) {
>     import std.algorithm, std.array, std.stdio;
>     char[] line = "0123456789".dup.repeat(50_000_000).join;

For fair test line should be feet into L1 cache. --Ilya

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Ilya Yaroshenko
in reply to Ilya Yaroshenko

Ilya Yaroshenko

Posted in reply to Ilya Yaroshenko

On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko wrote:
> On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
>> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>>
>> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>>
>> =====
>> import std.range;
>>
>> alias myPopFront = std.range.popFront;
>> alias myFront = std.range.front;
>>
>> void main(string[] args) {
>>     import std.algorithm, std.array, std.stdio;
>>     char[] line = "0123456789".dup.repeat(50_000_000).join;
>
> For fair test line should be feet into L1 cache. --Ilya

EDITL For fair test the line should be in the L1 cache. --Ilya

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Andrei Alexandrescu
in reply to Stefan Koch

Andrei Alexandrescu

Posted in reply to Stefan Koch

On 10/12/2016 12:03 PM, Stefan Koch wrote:
> This will only work really efficiently with some state on the stack.

Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Johan Engelen
in reply to Andrei Alexandrescu

Johan Engelen

Posted in reply to Andrei Alexandrescu

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> 
> On my machine, with "ldc2 -release -O3 -enable-inlining"

"-O3 -enable-inlining" is synonymous with "-O3"  :-)

With LDC 1.1.0-beta3, you can try with "-enable-cross-module-inlining". It won't cross-module inline everything (notably: nested functions), but it's a start. If you see large performance differences, let me know.

cheers,
  Johan

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by safety0ff
in reply to Andrei Alexandrescu

safety0ff

Posted in reply to Andrei Alexandrescu

On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>
> Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei

My measurements:
ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0

overhead 0.350s
bytes    0.385s
current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s

With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Stefan Koch
in reply to safety0ff

Stefan Koch

Posted in reply to safety0ff

On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:
> On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>>
>> Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
>
> My measurements:
> ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
> ldc version 1.0.0
>
> overhead 0.350s
> bytes    0.385s
> current autodecoding 0.915s (with new LUT popFront)
> copy-pasting std.utf decoding functions into current file 0.840s
> adding ASCII branch hints (llvm_expect) 0.770s
>
> With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.

where did you apply the branch hints ?

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by safety0ff
in reply to Stefan Koch

safety0ff

Posted in reply to Stefan Koch

On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote:
>
> where did you apply the branch hints ?

Code: http://pastebin.com/CFCpUftW

October 12, 2016

Re: Reducing the cost of autodecoding

Posted by Andrei Alexandrescu
in reply to safety0ff

Andrei Alexandrescu

Posted in reply to safety0ff

On 10/12/2016 04:02 PM, safety0ff wrote:
> On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>>
>> Remember the ASCII part is the bothersome one. There's only two
>> comparisons, all with 100% predictability. We should be able to
>> arrange matters so the loss is negligible. -- Andrei
>
> My measurements:
> ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
> ldc version 1.0.0
>
> overhead 0.350s
> bytes    0.385s

Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already?

> current autodecoding 0.915s (with new LUT popFront)
> copy-pasting std.utf decoding functions into current file 0.840s
> adding ASCII branch hints (llvm_expect) 0.770s
>
> With the branch hints LDC moves the non-Ascii code outside of the loop
> and creates a really tight loop body.

I think we should define two aliases "likely" and "unlikely" with default implementations:

bool likely(bool b) { return b; }
bool unlikely(bool b) { return b; }

They'd go in druntime. Then implementers can hook them into their intrinsics.

Works?


Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation