[4Walter&Andrei] D is 40 times slower. We need a new language feature!

May 20, 2017

9il

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

May 20, 2017

[4Walter&Andrei] D is 40 times slower. We need a new language feature!

Posted by 9il

Permalink

9il

Permalink

Hello,

When users write math code, they expect [2, 3, 4] that the code like

------
import mir.ndslice; //[1]

...

foreach (i; 0..m)
{
   foreach (j; 0..n)
   {
       // use matrix1[i, j], matrix2[i, j], matrix3[i, j]
   }
}
------

will be vectorized like in Fortran and other math languages.

There are different kinds of engineers and good math engineer may not have engineering education. They may be economists, physicist, geologist and etc. The may not be programmers or they may not be D programmers. They do not want to learn special loop free programming idioms. They just want their code be fast with ordinary loops.

In D the example above can not be vectorised.

The reason is that `matrixX[i, j]` is opIndex call, opIndex is a function. It can be inlined. But optimizers can not split its body and move half of opIndex computations out of the inner loop, which it required for vectorization.

Optimizers should do

------------
foreach (i; 0..m)
{
   auto __tempV1 = matrix1[i];
   auto __tempV2 = matrix2[i];
   auto __tempV3 = matrix3[i];
   foreach (j; 0..n)
   {
       // use __tempV1[j], __tempV2[j], __tempV3[j]
   }
}
------------

As was said optimizsers can not split opIndex body because it is function (inlined or not  inlined does not matter).

Walter, Andrei, and D community please help to make D simple for math world!

I do not know what language changes we should add. I only know how it should look for compiler:

------
import mir.ndslice;
...

foreach (i; 0..m)
{
   foreach (j; 0..n)
   {
       // matrixX[i, j] should be transformed to

       // matrix.ptr[matrix.stride * i + j]

      //  it is simplified version, ndslice has more complex API
   }
}
------

Looks like Rust macro system can do something similar.

What can I do to make it happen?

Ilya

[1] https://github.com/libmir/mir-algorithm
[2] https://gist.github.com/dextorious/d987865a7da147645ae34cc17a87729d
[3] https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18
[4] https://gitter.im/libmir/public

On 20/05/2017 4:24 AM, 9il wrote: snip > Looks like Rust macro system can do something similar. Just to confirm, in Rust is it calling a function to assign the value? E.g. ```D void opIndexAssign(T v, size_t j, size_t i); ``` Because if the compiler is seeing a function call, it probably won't attempt this optimization, but if its seeing a direct to memory write, that's a different story. Also the compiler doesn't know if ``foo[i][j]`` is equal to ``foo[i, j]``. So your proposed solution isn't valid.

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: > Hello, > > When users write math code, they expect [2, 3, 4] that the code like > > [...] I assume you compiled with LDC and used pragma(inline, true). Have you had a chance to look at what `-fsave-optimization-record` gives you? (This was adde to LDC master 3 days ago.) That may give some insight as to why it was not vectorised, e.g. bounds checks not removed. I dont think we need a new language feature for this.

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: > What can I do to make it happen? Sounds like you're asking for opIndex currying? https://en.wikipedia.org/wiki/Currying Have you tried implementing opIndex as a function which takes a single argument, and returns an object which then also implements opIndex with a single argument? You would probably need to write matrix[2][4] instead of matrix[2, 4], but that doesn't look hard to fix as well. > As was said optimizsers can not split opIndex body because it is function (inlined or not inlined does not matter). Have you tried splitting the opIndex implementation into two functions, one with just the code that should always be inlined, and one with the rest of the code that doesn't necessarily have to be inlined? How about pragma(inline), does that help?

On Saturday, 20 May 2017 at 03:53:19 UTC, Nicholas Wilson wrote: > On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: >> Hello, >> >> When users write math code, they expect [2, 3, 4] that the code like >> >> [...] > > I assume you compiled with LDC and used pragma(inline, true). Have you had a chance to look at what `-fsave-optimization-record` gives you? (This was adde to LDC master 3 days ago.) That may give some insight as to why it was not vectorised, e.g. bounds checks not removed. > > I dont think we need a new language feature for this. No. It is exactly because it is single function for column and row indexes.

On Saturday, 20 May 2017 at 03:50:31 UTC, rikki cattermole wrote: > On 20/05/2017 4:24 AM, 9il wrote: > snip >> Looks like Rust macro system can do something similar. > > Just to confirm, in Rust is it calling a function to assign the value? I mean that Rust has macro system. I do not know if it can be used for indexing. > E.g. > ```D > void opIndexAssign(T v, size_t j, size_t i); > ``` > > Because if the compiler is seeing a function call, it probably won't attempt this optimization, but if its seeing a direct to memory write, that's a different story. > > Also the compiler doesn't know if ``foo[i][j]`` is equal to ``foo[i, j]``. So your proposed solution isn't valid. Proposed solution is similar to mixins. Compiler do not need to know something. Compiler just need to replace `foo[i, j]` with its body like it is a mixin. All loop optimiation would be done by optimizer. See clang loop optimisation for more details.

On Saturday, 20 May 2017 at 03:53:42 UTC, Vladimir Panteleev wrote: > On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: >> What can I do to make it happen? > > Sounds like you're asking for opIndex currying? > > https://en.wikipedia.org/wiki/Currying > > Have you tried implementing opIndex as a function which takes a single argument, and returns an object which then also implements opIndex with a single argument? You would probably need to write matrix[2][4] instead of matrix[2, 4], but that doesn't look hard to fix as well. Yes, matrix[i][j] allows vectorization. This is already implemented. In the same time users prefer [i, j] syntax. So it should be deprecated :-/ >> As was said optimizsers can not split opIndex body because it is function (inlined or not inlined does not matter). > > Have you tried splitting the opIndex implementation into two functions, one with just the code that should always be inlined, and one with the rest of the code that doesn't necessarily have to be inlined? ditto > How about pragma(inline), does that help? No

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: > The reason is that `matrixX[i, j]` is opIndex call, opIndex is a function. It can be inlined. But optimizers can not split its body and move half of opIndex computations out of the inner loop, which it required for vectorization. Hmm, look like new LLVM solves this issue. Need to do more benchmarks...

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: > Hello, > > When users write math code, they expect [2, 3, 4] that the code like > > [...] What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?

On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote: > On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote: >> Hello, >> >> When users write math code, they expect [2, 3, 4] that the code like >> >> [...] > > What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. > > Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining? I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41

Forums