October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote:
>
> I think we should define two aliases "likely" and "unlikely" with default implementations:
>
> bool likely(bool b) { return b; }
> bool unlikely(bool b) { return b; }
>
> They'd go in druntime. Then implementers can hook them into their intrinsics.
>
> Works?
>
>
> Andrei
I was about to suggest the same.
I can prepare a PR.
|
October 13, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Wednesday, 12 October 2016 at 23:59:15 UTC, Stefan Koch wrote:
> On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote:
>>
>> I think we should define two aliases "likely" and "unlikely" with default implementations:
>>
>> bool likely(bool b) { return b; }
>> bool unlikely(bool b) { return b; }
>>
>> They'd go in druntime. Then implementers can hook them into their intrinsics.
>>
>> Works?
>>
>>
>> Andrei
>
> I was about to suggest the same.
> I can prepare a PR.
We should probably introduce a new module for stuff like this.
object.d is already filled with too much unrelated things.
|
October 13, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote:
>
> Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already?
>
It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)
|
October 13, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
>
> It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)
Measurements without -mcpu=native:
overhead 0.336s
bytes 0.610s
without branch hints 0.852s
code pasted 0.766s
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On 10/12/2016 08:11 PM, Stefan Koch wrote:
> We should probably introduce a new module for stuff like this.
> object.d is already filled with too much unrelated things.
Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On 10/12/2016 08:41 PM, safety0ff wrote:
> On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
>>
>> It made little difference: LDC compiled into AVX2 vectorized addition
>> (vpmovzxbq & vpaddq.)
>
> Measurements without -mcpu=native:
> overhead 0.336s
> bytes 0.610s
> without branch hints 0.852s
> code pasted 0.766s
So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei
|
October 13, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu wrote:
> On 10/12/2016 08:11 PM, Stefan Koch wrote:
>> We should probably introduce a new module for stuff like this.
>> object.d is already filled with too much unrelated things.
>
> Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
maybe core.intrinsics ?
or code.codelayout ?
We can control the layout at object-file level.
We should be able to expose some of that functionality.
|
October 13, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:
> On 10/12/2016 08:41 PM, safety0ff wrote:
>> On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
>>>
>>> It made little difference: LDC compiled into AVX2 vectorized addition
>>> (vpmovzxbq & vpaddq.)
>>
>> Measurements without -mcpu=native:
>> overhead 0.336s
>> bytes 0.610s
>> without branch hints 0.852s
>> code pasted 0.766s
>
> So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei
AVX for ascii ?
What are you referring to ?
Most text processing is terribly incompatible with simd.
sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.
|
October 12, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On 10/12/2016 09:35 PM, Stefan Koch wrote:
> On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:
>> On 10/12/2016 08:41 PM, safety0ff wrote:
>>> On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
>>>>
>>>> It made little difference: LDC compiled into AVX2 vectorized addition
>>>> (vpmovzxbq & vpaddq.)
>>>
>>> Measurements without -mcpu=native:
>>> overhead 0.336s
>>> bytes 0.610s
>>> without branch hints 0.852s
>>> code pasted 0.766s
>>
>> So we should be able to reduce overhead by means of proper code
>> arrangement and interplay of inlining and outlining. The prize,
>> however, would be to get the AVX instructions for ASCII going. Is that
>> possible? -- Andrei
>
> AVX for ascii ?
> What are you referring to ?
> Most text processing is terribly incompatible with simd.
> sse 4.2 has a few instructions that do help, but as far as I am aware it
> is not yet too far spread.
Oh ok, so it's that checksum in particular that got optimized. Bad benchmark! Bad! -- Andrei
|
October 13, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu wrote: > On 10/12/2016 08:11 PM, Stefan Koch wrote: >> We should probably introduce a new module for stuff like this. >> object.d is already filled with too much unrelated things. > > Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei There could be some kind of "expect" theme, or a microoptimization theme. Functions that have no observable effects and that provide hints for optimization (possibly compiler-dependent implementations of those functions). Besides providing the expected value of an expression, other "expect"/microopt functionality is checking explicitly for function pointer values (to inline a likely function), that I wrote about in [1]: ``` /// Calls `fptr(args)`, optimize for the case when fptr points to Likely(). pragma(inline, true) auto is_likely(alias Likely, Fptr, Args...)(Fptr fptr, Args args) { return (fptr == &Likely) ? Likely(args) : fptr(args); } // ... void function() fptr = get_function_ptr(); fptr.is_likely!likely_function(); ``` A similar function can be made for "expecting" a class type for virtual calls [1]. Other microopt thingies that come to mind are: - cache prefetching - function attributes for hot/cold functions cheers, Johan [1] https://johanengelen.github.io/ldc/2016/04/13/PGO-in-LDC-virtual-calls.html |
Copyright © 1999-2021 by the D Language Foundation