August 19, 2015
On Wednesday, 19 August 2015 at 09:55:19 UTC, Dmitry Olshansky wrote:
> On 19-Aug-2015 12:46, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>> On Wednesday, 19 August 2015 at 09:29:31 UTC, Dmitry Olshansky wrote:
>>> I do not. I underestime the benefits of tons of subtle passes that
>>> play into 0.1-0.2% in some cases. There are lots and lots of this in
>>> GCC/LLVM. If having the best code generated out there is not the goal
>>> we can safely omit most of these focusing on the most critical bits.
>>
>> Well, you can start on this now, but by the time it is ready and
>> hardened, LLVM might have received improved AVX2 and AVX-512 code gen
>> from Intel. Which basically will leave DMD in the dust.
>>
>
> On numerics, video-codecs and the like. Not like compilers solely depend on AVX.

Even in video codec, AVX2 is not that useful and barely brings a 10% improvements over SSE, while being extra careful with SSE-AVX transition penalty. And to reap this benefit you would have to write in intrinsics/assembly.
For AVX-512 I can't even imagine what to use such large register for. Larger registers => more spilling because of calling conventions, and more fiddling around with complicated shuffle instructions. There is a steep diminishing returns with increasing registers size.
August 19, 2015
On Wednesday, 19 August 2015 at 09:55:19 UTC, Dmitry Olshansky wrote:
> On 19-Aug-2015 12:46, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>> Well, you can start on this now, but by the time it is ready and
>> hardened, LLVM might have received improved AVX2 and AVX-512 code gen
>> from Intel. Which basically will leave DMD in the dust.
>>
>
> On numerics, video-codecs and the like. Not like compilers solely depend on AVX.

Compilers are often written for scalars, but they are also just one benchmark that compilers are evaluated by.

DMD could use multiple backends, use it's own performance-estimator (ran on generated code) and pick the best output from each backend.

D could leverage increased register sizes for parameter transfer between non-C callable functions. Just that alone could be beneficial. Clearly having 256/512 bit wide registers matters. And you need to coordinate how the packing is done so you don't have to shuffle.

Lots of options in there, but you need to be different from LLVM. You can't just take an old SSA and improve on it.

Another option is to take the C++ to D converter used for building DDMD and see if it can be extended to work on LLVM.


August 19, 2015
On Wednesday, 19 August 2015 at 10:08:48 UTC, ponce wrote:
> Even in video codec, AVX2 is not that useful and barely brings a 10% improvements over SSE, while being extra careful with SSE-AVX transition penalty. And to reap this benefit you would have to write in intrinsics/assembly.

Masked AVX instructions are turned into NOPs. So you can remove conditionals from inner loops. Performance of new instructions tend to improve generation by generation.

> For AVX-512 I can't even imagine what to use such large register for. Larger registers => more spilling because of calling conventions, and more fiddling around with complicated shuffle instructions. There is a steep diminishing returns with increasing registers size.

You have to plan your data layout. Which is why libraries should target it, so end users don't have to think too much about it. If your computations are trivial, then you are essentially memory I/O limited. SOA processing isn't really limited by shuffling. Stuff like mapping a pure function over a collection of arrays.

August 19, 2015
On Wednesday, 19 August 2015 at 10:16:18 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 19 August 2015 at 10:08:48 UTC, ponce wrote:
>> Even in video codec, AVX2 is not that useful and barely brings a 10% improvements over SSE, while being extra careful with SSE-AVX transition penalty. And to reap this benefit you would have to write in intrinsics/assembly.
>
> Masked AVX instructions are turned into NOPs. So you can remove conditionals from inner loops. Performance of new instructions tend to improve generation by generation.

Loops in video coding already have no conditional. And for the one who have, conditionals were already removeable with existing instructions.

>> For AVX-512 I can't even imagine what to use such large register for. Larger registers => more spilling because of calling conventions, and more fiddling around with complicated shuffle instructions. There is a steep diminishing returns with increasing registers size.
>
> You have to plan your data layout. Which is why libraries should target it, so end users don't have to think too much about it. If your computations are trivial, then you are essentially memory I/O limited. SOA processing isn't really limited by shuffling. Stuff like mapping a pure function over a collection of arrays.

I stand by what I know and measured: previously few things are speed up by AVX-xxx. It almost always better investing this time to optimize somewhere else.
August 19, 2015
On 19-Aug-2015 13:09, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Wednesday, 19 August 2015 at 09:55:19 UTC, Dmitry Olshansky wrote:
>> On 19-Aug-2015 12:46, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?=
>> <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>>> Well, you can start on this now, but by the time it is ready and
>>> hardened, LLVM might have received improved AVX2 and AVX-512 code gen
>>> from Intel. Which basically will leave DMD in the dust.
>>>
>>
>> On numerics, video-codecs and the like. Not like compilers solely
>> depend on AVX.
>
> Compilers are often written for scalars, but they are also just one
> benchmark that compilers are evaluated by.
>
> DMD could use multiple backends, use it's own performance-estimator (ran
> on generated code) and pick the best output from each backend.
>

This meets what goal? As I said it's apparent that folks like DMD for fast compile times not for inhumanly good codegen.

> D could leverage increased register sizes for parameter transfer between
> non-C callable functions. Just that alone could be beneficial. Clearly
> having 256/512 bit wide registers matters.

Load/unload via shuffling or RTT through stack going to murder that though.

> And you need to coordinate
> how the packing is done so you don't have to shuffle.
>

Given how flexble the current data types are I hardly see it implemented in a sane way not to mention benefits could berather slim. Lastly - why the "omnipotnent" (by this thread) LLVM/GCC guys won't implement it yet?

> Lots of options in there, but you need to be different from LLVM. You
> can't just take an old SSA and improve on it.

To slightly gain? Again the goal of maximizing the gains of vectors ops is hardly interesting IMO.



-- 
Dmitry Olshansky
August 19, 2015
On Wednesday, 19 August 2015 at 10:25:14 UTC, ponce wrote:
> Loops in video coding already have no conditional. And for the one who have, conditionals were already removeable with existing instructions.

I think you are side-stepping the issue. Most people don't write video codecs. Most people also don't want to hand optimize their inner loops. The typical and most likely scenario is to run some easy-to-read-but-suboptimal function over a dataset. You both need library and compiler support for that to work out.

But even then: 10% difference in CPU benchmarks is a disaster.

> I stand by what I know and measured: previously few things are speed up by AVX-xxx. It almost always better investing this time to optimize somewhere else.

AVX-512 is too far into the future, but if you are going to write a backend you have to think about increasing register sizes. Just because register size increase does not mean that throughput increases in the generation it was introduced (it could translate into several micro-ops).

But if you start redesigning your back end now then maybe you have something good in 5 years, so you need to plan ahead, not thinking about current gen, but 1-3 generations ahead.

Keep in mind that clock speeds are unlikely to increase, but stacking of memory on top of the CPU and getting improved memory bus speeds is a quite likely scenario.

A good use for the DMD backend would be to improve and redesign it for compile time evaluation. Then use LLVM for codegen.

August 19, 2015
On Wednesday, 19 August 2015 at 10:33:40 UTC, Dmitry Olshansky wrote:
> Given how flexble the current data types are I hardly see it implemented in a sane way not to mention benefits could berather slim. Lastly - why the "omnipotnent" (by this thread) LLVM/GCC guys won't implement it yet?

They are stuck on C semantics, and so are their optimizer. But LLVM have other calling conventions for Haskell and other languages.

I believe Pony is going to use register passing internally and  C ABI externally using LLVM.

> To slightly gain? Again the goal of maximizing the gains of vectors ops is hardly interesting IMO.

Well… I can't argue with what you find interesting. Memory throughput and pipeline bubbles are the key bottlenecks these days.

But I agree that the key point should be compilation speed / debugging. In terms of PR it would be better to say that DMD are making debug builds than to say it has a subpar optimizer.

August 19, 2015
On Tuesday, 18 August 2015 at 12:58:45 UTC, Dicebot wrote:
> On Tuesday, 18 August 2015 at 12:37:37 UTC, Vladimir Panteleev wrote:
>> I think stability of the DMD backend is a goal of much higher value than the performance of the code it emits. DMD is never going to match the code generation quality of LLVM and GCC, which have had many, many man-years invested in them. Working on DMD optimizations is essentially duplicating this work, and IMHO I think it's not only a waste of time, but harmful to D because of the risk of regressions.
>
> +1

+1
August 19, 2015
On Wednesday, 19 August 2015 at 10:50:24 UTC, Ola Fosheim Grøstad wrote:
> Well… I can't argue with what you find interesting. Memory throughput and pipeline bubbles are the key bottlenecks these days.

And just to stress this point: if you code is spending 50% of the time waiting for memory and your code is 25% slower than the competitor, then it might actually be 50% slower than the competitor for code that is memory optimal.

So it's not like you just have to make your code a little bit faster, you have to make it twice as fast.

The only way to go past that is to have a very intelligent optimizer that can remove memory bottle necks and then you need the much more advanced cache/SIMD-oriented optimizer and probably also change the language semantics so that memory layout can be reordered.

August 19, 2015
On 18-Aug-2015 15:37, Vladimir Panteleev wrote:
> I think stability of the DMD backend is a goal of much higher value than
> the performance of the code it emits. DMD is never going to match the
> code generation quality of LLVM and GCC, which have had many, many
> man-years invested in them. Working on DMD optimizations is essentially
> duplicating this work, and IMHO I think it's not only a waste of time,
> but harmful to D because of the risk of regressions.

How about stress-testing with some simple fuzzer:
1. Generate a sequence of pluasable expressions/functions.
2. Spit out results via printf.
3. Permute -O -inline and compare the outputs.

-- 
Dmitry Olshansky