October 12, 2016 Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: ===== import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; ulong checksum; if (args.length == 1) { while (line.length) { version(autodecode) { checksum += line.myFront; line.myPopFront; } else { checksum += line[0]; line = line[1 .. $]; } } version(autodecode) writeln("autodecode ", checksum); else writeln("bytes ", checksum); } else writeln("overhead"); } ===== On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding. Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap. Andrei |
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>
> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>
> =====
> import std.range;
>
> alias myPopFront = std.range.popFront;
> alias myFront = std.range.front;
>
> void main(string[] args) {
> import std.algorithm, std.array, std.stdio;
> char[] line = "0123456789".dup.repeat(50_000_000).join;
> ulong checksum;
> if (args.length == 1)
> {
> while (line.length) {
> version(autodecode)
> {
> checksum += line.myFront;
> line.myPopFront;
> }
> else
> {
> checksum += line[0];
> line = line[1 .. $];
> }
> }
> version(autodecode)
> writeln("autodecode ", checksum);
> else
> writeln("bytes ", checksum);
> }
> else
> writeln("overhead");
> }
> =====
>
> On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding.
>
> Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.
>
>
> Andrei
This will only work really efficiently with some state on the stack.
If we are to support Unicode.
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>
> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>
> =====
> import std.range;
>
> alias myPopFront = std.range.popFront;
> alias myFront = std.range.front;
>
> void main(string[] args) {
> import std.algorithm, std.array, std.stdio;
> char[] line = "0123456789".dup.repeat(50_000_000).join;
For fair test line should be feet into L1 cache. --Ilya
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko wrote:
> On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
>> So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.
>>
>> Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark:
>>
>> =====
>> import std.range;
>>
>> alias myPopFront = std.range.popFront;
>> alias myFront = std.range.front;
>>
>> void main(string[] args) {
>> import std.algorithm, std.array, std.stdio;
>> char[] line = "0123456789".dup.repeat(50_000_000).join;
>
> For fair test line should be feet into L1 cache. --Ilya
EDITL For fair test the line should be in the L1 cache. --Ilya
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On 10/12/2016 12:03 PM, Stefan Koch wrote:
> This will only work really efficiently with some state on the stack.
Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote: > > On my machine, with "ldc2 -release -O3 -enable-inlining" "-O3 -enable-inlining" is synonymous with "-O3" :-) With LDC 1.1.0-beta3, you can try with "-enable-cross-module-inlining". It won't cross-module inline everything (notably: nested functions), but it's a start. If you see large performance differences, let me know. cheers, Johan |
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>
> Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
My measurements:
ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0
overhead 0.350s
bytes 0.385s
current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s
With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:
> On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
>>
>> Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
>
> My measurements:
> ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining
> ldc version 1.0.0
>
> overhead 0.350s
> bytes 0.385s
> current autodecoding 0.915s (with new LUT popFront)
> copy-pasting std.utf decoding functions into current file 0.840s
> adding ASCII branch hints (llvm_expect) 0.770s
>
> With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
where did you apply the branch hints ?
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote: > > where did you apply the branch hints ? Code: http://pastebin.com/CFCpUftW |
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On 10/12/2016 04:02 PM, safety0ff wrote: > On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote: >> >> Remember the ASCII part is the bothersome one. There's only two >> comparisons, all with 100% predictability. We should be able to >> arrange matters so the loss is negligible. -- Andrei > > My measurements: > ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining > ldc version 1.0.0 > > overhead 0.350s > bytes 0.385s Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already? > current autodecoding 0.915s (with new LUT popFront) > copy-pasting std.utf decoding functions into current file 0.840s > adding ASCII branch hints (llvm_expect) 0.770s > > With the branch hints LDC moves the non-Ascii code outside of the loop > and creates a really tight loop body. I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? Andrei |
Copyright © 1999-2021 by the D Language Foundation