November 12, 2016
On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:
> On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
>>
>> There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop.
>
> Does the C++ need `__restrict__` for the parameters to get the assembly you want?

In this case, it doesn't seem to make any difference. It is habitual for me to use __restrict__ whenever possible in HPC code, but very often Clang/GCC are smart enough nowadays to make the inference regardless.

On that note, I was under the impression that D arrays included the no aliasing assumption. If that's not the case, is there a way to achieve the equivalent of __restrict__ in D?

>
>> The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.
>
> It is easier for the discussion if you paste godbolt.org links btw, so we don't have to manually do it ourselves ;-)
>
> -Johan

Will do. :)

By the way, I posted that issue on GH: https://github.com/ldc-developers/ldc/issues/1874

November 12, 2016
On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
> On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:
>>
>> Does the C++ need `__restrict__` for the parameters to get the assembly you want?
>
> In this case, it doesn't seem to make any difference.

That's good news, because there is currently no way to add that to LDC code, afaik.

Hope you can try to cut more of these things from the example so it's easier to figure out why things are different.  (e.g. is -Ofast needed, or is -O3 enough?)

Thanks!

cheers,
  Johan



November 12, 2016
On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen wrote:
> On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
>> On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:
>>>
>>> Does the C++ need `__restrict__` for the parameters to get the assembly you want?
>>
>> In this case, it doesn't seem to make any difference.
>
> That's good news, because there is currently no way to add that to LDC code, afaik.

I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.

>
> Hope you can try to cut more of these things from the example so it's easier to figure out why things are different.  (e.g. is -Ofast needed, or is -O3 enough?)
>
> Thanks!
>
> cheers,
>   Johan

-Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.

November 12, 2016
On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
> On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen wrote:
>> On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
>>> On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:
>>>>
>>>> Does the C++ need `__restrict__` for the parameters to get the assembly you want?
>>>
>>> In this case, it doesn't seem to make any difference.
>>
>> That's good news, because there is currently no way to add that to LDC code, afaik.
>
> I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.

Can you file an issue for that too? (ideas in forum posts get lost instantly)
Make sure you add an (as small as possible) testcase that shows a clear difference in codegen (with/without for C++), and with worse codegen with D code without it.
It may be relatively easy to implement it in LDC, but I don't think many people know the intricacies of C's restrict. With examples of the effect it has on assembly (clang C++) helps a lot towards getting it implemented.

> -Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.

OK great.
I think you ran into a compiler limitation somehow, so make sure you submit the simplified example/testcase on GH ! ;)
(the simpler you can make it, the better)

Btw, for benchmarking, you should mark the `compute_neq` function as "weak linkage", such that the compiler is not going to do inter-procedural optimization for the call to `compute_neq` in `main`. (@weak for LDC, clang probably something like __attribute__((weak)))

November 12, 2016
On Saturday, 12 November 2016 at 12:11:35 UTC, Johan Engelen wrote:
> On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
>> On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen wrote:
>>> On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
>>>> On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:
>>>>>
>>>>> Does the C++ need `__restrict__` for the parameters to get the assembly you want?
>>>>
>>>> In this case, it doesn't seem to make any difference.
>>>
>>> That's good news, because there is currently no way to add that to LDC code, afaik.
>>
>> I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.
>
> Can you file an issue for that too? (ideas in forum posts get lost instantly)
> Make sure you add an (as small as possible) testcase that shows a clear difference in codegen (with/without for C++), and with worse codegen with D code without it.
> It may be relatively easy to implement it in LDC, but I don't think many people know the intricacies of C's restrict. With examples of the effect it has on assembly (clang C++) helps a lot towards getting it implemented.
>
>> -Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.
>
> OK great.
> I think you ran into a compiler limitation somehow, so make sure you submit the simplified example/testcase on GH ! ;)
> (the simpler you can make it, the better)
>
> Btw, for benchmarking, you should mark the `compute_neq` function as "weak linkage", such that the compiler is not going to do inter-procedural optimization for the call to `compute_neq` in `main`. (@weak for LDC, clang probably something like __attribute__((weak)))

Okay, I'll clean up the code and post an issue on GH later today, hopefully someone can figure out where the discrepancy comes from.

I'll also file a separate issue / feature request for restrict afterwards, once I write up a representative test case that highlights the performance impact.

Thanks for your help! The ability to get quick responses on compiler issues like this is really encouraging me to write more high performance code in D.
November 12, 2016
Okay, so I've done some further experimentation with rather peculiar results. On the bright side, I'm now fairly sure this isn't an outright bug in the compiler. On the flip side, however, I'm quite confused about the results.

For the record, here are the current versions of the benchmark in godbolt:
D:   https://godbolt.org/g/B8gosP
C++: https://godbolt.org/g/DWjQrV

Apparently, LDC can be coaxed to use FMA instructions after all. It seems that with __attribute__((__weak__)) Clang produces code that is essentially identical to the D binary, both run in about 19ms on my machine. When I remove __attribute__((__weak__)) and make the compute_neq function static void rather than simply void, Clang further unrolls the inner loop and uses a number of optimized load/store instructions that increase the performance by a huge margin - down to about 7ms. As for LDC, changing adding/removing @weak and static also has a major impact on the generated code and therefore the performance.

I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?
November 12, 2016
On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
>
> I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?

I think that perhaps when inlining the fastmath function, some optimization attributes are lost somehow and the inlined code is not optimized as much (you'd have to specify @fastmath on main too).

It'd be easier to compare with -ffast-math I guess ;-)

A look at the generated LLVM IR may provide some clues.
November 12, 2016
On Saturday, 12 November 2016 at 16:29:20 UTC, Johan Engelen wrote:
> On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
>>
>> I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?
>
> I think that perhaps when inlining the fastmath function, some optimization attributes are lost somehow and the inlined code is not optimized as much (you'd have to specify @fastmath on main too).
>
> It'd be easier to compare with -ffast-math I guess ;-)
>
> A look at the generated LLVM IR may provide some clues.

I tried putting @fastmath on main as well, it makes no difference whatsoever (identical generated assembly). Apart from the weirdness with weak/static making way more difference than I would intuitively expect, it seems the major factor preventing performance parity with Clang is the conservative loop optimizations. Is there a way, similar to #pragma unroll in Clang, to tell LDC to try to unroll the inner loop?
November 12, 2016
On Saturday, 12 November 2016 at 16:40:27 UTC, deXtoRious wrote:
> 
> I tried putting @fastmath on main as well, it makes no difference whatsoever (identical generated assembly).

Yeah I saw it too. It's a bit strange.

> Apart from the weirdness with weak/static making way more difference than I would intuitively expect,

I am also surprised but: adding `static` in C++ makes it a fully private function, which does not need to be emitted as such (and isn't in your case, because it is fully inlined).
I added `pragma(inline, true)` to the D function to get a similar effect, I hoped.

> it seems the major factor preventing performance parity with Clang is the conservative loop optimizations. Is there a way, similar to #pragma unroll in Clang, to tell LDC to try to unroll the inner loop?

There isn't at the moment. We need a mechanism to tag statements with such metadata. In LLVM IR, this is what you'd want: http://llvm.org/docs/LangRef.html#llvm-loop
I am not enough of a D expert to come up with a good way to do this. Perhaps David can help come up with a solution?
Good stuff for another Github issue! ;-)

November 14, 2016
On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen wrote:
> I am not enough of a D expert to come up with a good way to do this.

Spec says pragma can be applied to statements: https://dlang.org/spec/pragma.html