June 14, 2015
On Sunday, 14 June 2015 at 14:46:36 UTC, Ola Fosheim Grøstad wrote:
> Yes, C++ templates are a hard nut to crack, if D had added excellent pattern matching to its meta programming repertoire the I think this would be enough to put D in a different league.
>

https://github.com/solodon4/Mach7
June 14, 2015
> A naive basic matrix library is simple to write, I don't need standard library support for that + I get it to work the way I want by using SIMD registers directly... => I probably would not use it if I could implement it in less than 10 hours.

A naive std.algorithm and std.range is easy to write too.
June 14, 2015
On Sunday, 14 June 2015 at 15:15:38 UTC, Ilya Yaroshenko wrote:
>
>> A naive basic matrix library is simple to write, I don't need standard library support for that + I get it to work the way I want by using SIMD registers directly... => I probably would not use it if I could implement it in less than 10 hours.
>
> A naive std.algorithm and std.range is easy to write too.

I wouldn't know. People have different needs. Builtin for-each-loops, threads and SIMD support are more important to me than iterators (ranges).

But the problem with linear algebra is that you might want to do SIMD optimized versions where you calculate 4 equations at the time, do reshuffling etc. So a library solution has to provide substantial benefits.


June 14, 2015
On Sunday, 14 June 2015 at 18:05:33 UTC, Ola Fosheim Grøstad wrote:
> On Sunday, 14 June 2015 at 15:15:38 UTC, Ilya Yaroshenko wrote:
>>
>>> A naive basic matrix library is simple to write, I don't need standard library support for that + I get it to work the way I want by using SIMD registers directly... => I probably would not use it if I could implement it in less than 10 hours.
>>
>> A naive std.algorithm and std.range is easy to write too.
>
> I wouldn't know. People have different needs. Builtin for-each-loops, threads and SIMD support are more important to me than iterators (ranges).
>
> But the problem with linear algebra is that you might want to do SIMD optimized versions where you calculate 4 equations at the time, do reshuffling etc. So a library solution has to provide substantial benefits.

Yes, but it would be hard to create SIMD optimised version.

What do you think about this chain of steps?

1. Create generalised (only type template and my be flags) BLAS algorithms (probably  slow) with CBLAS like API.
2. Allow users to use existing CBLAS libraries inside generalised BLAS.
3. Start to improve generalised BLAS with SIMD instructions.
4. And then continue discussion about type of matrixes we want...

June 14, 2015
On Sunday, 14 June 2015 at 18:49:21 UTC, Ilya Yaroshenko wrote:
> Yes, but it would be hard to create SIMD optimised version.

Then again clang is getting better at this stuff.

> What do you think about this chain of steps?
>
> 1. Create generalised (only type template and my be flags) BLAS algorithms (probably  slow) with CBLAS like API.
> 2. Allow users to use existing CBLAS libraries inside generalised BLAS.
> 3. Start to improve generalised BLAS with SIMD instructions.
> 4. And then continue discussion about type of matrixes we want...

Hmm… I don't know. In general I think the best thing to do is to develop libraries with a project and then turn it into something more abstract.

If I had more time I think I would have made the assumption that we could make LDC produce whatever next version of clang can do with pragmas/GCC-extensions and used that assumption for building some prototypes. So I would:

1. protoype typical constructs in C, compile it with next version of llvm/clang (with e.g. 4xloop-unrolling and try different optimization/vectorizing options) the look at the output in LLVM IR and assembly mnemonic code.

2. Then write similar code with hardware optimized BLAS and benchmark where the overhead between pure C/LLVM and BLAS calls balance out to even.

Then you have a rough idea of what the limitations of the current infrastructure looks like, and can start modelling the template types in D?

I'm not sure that you should use SIMD directly, but align the memory for it. Like, on iOS you end up using LLVM subsets because of the new bitcode requirements. Ditto for PNACL.

Just a thought, but that's what I would I do.

June 14, 2015
Another thing worth noting is that I believe Intel has put some effort into next gen (?) LLVM/Clang for autovectorizing into AVX2. It might be worth looking into as it uses a mask that allows the CPU to skip computations that would lead to no change, but I think it is only available on last gen Intel CPUs.

Also worth keeping in mind is that future versions of LLVM will have to deal with GCC extensions and perhaps also Clang pragmas. So maybe take a look at:

http://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors

and

http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

?

June 14, 2015
>> 1. Create generalised (only type template and my be flags) BLAS algorithms (probably  slow) with CBLAS like API.
See [1] (the Matmul benchmark) Julia Native is probably backed with Intel MKL or OpenBLAS. D version was optimized by Martin Nowak [2] and is still _much_ slower.

>> 2. Allow users to use existing CBLAS libraries inside generalised BLAS.
I think a good interface is more important than speed of default implementation (at least for e.g large matrix multiplication). Just use existing code for speed...
Goto's papers about his BLAS: [3][4]
Having something a competitive in D would be great but probably a lot of work. Without a good D interface  dstep + openBLAS/Atlas header will not look that bad. Note I am not talking about small matrices/graphics.

>> 3. Start to improve generalised BLAS with SIMD instructions.
nice, but not really important. Good interface to existing high quality BLAS seems more important to me than fast D linear algebra implementation + CBLAS like interface.

>> 4. And then continue discussion about type of matrixes we want...
>
+1

> 2. Then write similar code with hardware optimized BLAS and benchmark where the overhead between pure C/LLVM and BLAS calls balance out to even.
may there are more important / beneficial things to work on - assuming total time of contributors is fix and used for other D stuff:)

[1] https://github.com/kostya/benchmarks
[2] https://github.com/kostya/benchmarks/pull/6
[3] http://www.cs.utexas.edu/users/flame/pubs/GotoTOMS2.pdf
[4] http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
June 14, 2015
On Sunday, 14 June 2015 at 21:31:53 UTC, anonymous wrote:
>> 2. Then write similar code with hardware optimized BLAS and benchmark where the overhead between pure C/LLVM and BLAS calls balance out to even.
> may there are more important / beneficial things to work on - assuming total time of contributors is fix and used for other D stuff:)

Sure, but that is what I'd do if I had the time. Get a baseline for what kind of NxN sizes D can reasonably be expected to deal with in a "naive brute force" manner.

Then consider pushing anything beyond that over to something more specialized.

*shrugs*
June 15, 2015
On Sunday, 14 June 2015 at 21:50:02 UTC, Ola Fosheim Grøstad wrote:
> Sure, but that is what I'd do if I had the time. Get a baseline for what kind of NxN sizes D can reasonably be expected to deal with in a "naive brute force" manner.

In case it isn't obvious: a potential advantage of a simple algorithm that do "naive brute force" is that the backend might stand a better chance optimizing it, at least when you have a matrix that is known at compile time.

June 15, 2015
On Sunday, 14 June 2015 at 21:50:02 UTC, Ola Fosheim Grøstad wrote:
> On Sunday, 14 June 2015 at 21:31:53 UTC, anonymous wrote:
>>> 2. Then write similar code with hardware optimized BLAS and benchmark where the overhead between pure C/LLVM and BLAS calls balance out to even.
>> may there are more important / beneficial things to work on - assuming total time of contributors is fix and used for other D stuff:)
>
> Sure, but that is what I'd do if I had the time. Get a baseline for what kind of NxN sizes D can reasonably be expected to deal with in a "naive brute force" manner.
>
> Then consider pushing anything beyond that over to something more specialized.
>
> *shrugs*

On Sunday, 14 June 2015 at 21:50:02 UTC, Ola Fosheim Grøstad wrote:
> On Sunday, 14 June 2015 at 21:31:53 UTC, anonymous wrote:
>>> 2. Then write similar code with hardware optimized BLAS and benchmark where the overhead between pure C/LLVM and BLAS calls balance out to even.
>> may there are more important / beneficial things to work on - assuming total time of contributors is fix and used for other D stuff:)
>
> Sure, but that is what I'd do if I had the time. Get a baseline for what kind of NxN sizes D can reasonably be expected to deal with in a "naive brute force" manner.
>
> Then consider pushing anything beyond that over to something more specialized.
>
> *shrugs*

sorry, I should read more careful. I understand 'optimize default implementation to the speed of high quality BLAS for _any_/large matrix size'. Great if it is done but imo there is no real pressure to do it and probably needs lot of time of experts.

To benchmark when existing BLAS is actually faster is than 'naive brute force' sounds very good and reasonable.