2016Q1: std.blas - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Announce » 2016Q1: std.blas

Thread overview

2016Q1: std.blas
Dec 26, 2015 Ilya Yaroshenko
Dec 26, 2015 Ilya Yaroshenko
Mar 24, 2016 Nordlöw
Dec 27, 2015 Andrei Amatuni
Mar 23, 2016 paper rewriter
Dec 27, 2015 Basile B.
Dec 27, 2015 Ilya Yaroshenko
Dec 27, 2015 Andrei Alexandrescu
Dec 27, 2015 Charles McAnany
Dec 27, 2015 Ilya Yaroshenko
Dec 27, 2015 Russel Winder
Dec 27, 2015 Ilya Yaroshenko
Dec 30, 2015 jmh530
Mar 24, 2016 Nordlöw
Mar 24, 2016 9il
Mar 24, 2016 Nordlöw
Mar 28, 2016 9il

December 26, 2015

2016Q1: std.blas

Posted by Ilya Yaroshenko

Ilya Yaroshenko

Hi,

I will write GEMM and GEMV families of BLAS for Phobos.

Goals:
 - code without assembler
 - code based on SIMD instructions
 - DMD/LDC/GDC support
 - kernel based architecture like OpenBLAS
 - 85-100% FLOPS comparing with OpenBLAS (100%)
 - tiny generic code comparing with OpenBLAS
 - ability to define user kernels
 - allocators support. GEMM requires small internal allocations.
 - @nogc nothrow pure template functions (depends on allocator)
 - optional multithreaded
 - ability to work with `Slice` multidimensional arrays when stride between elements in vector is greater than 1. In common BLAS matrix strides between rows or columns always equals 1.

Implementation details:
LDC     all   : very generic D/LLVM IR kernels. AVX/2/512/neon support is out of the box.
DMD/GDC x86   : kernels for  8 XMM registers based on core.simd
DMD/GDC x86_64: kernels for 16 XMM registers based on core.simd
DMD/GDC other : generic kernels without SIMD instructions. AVX/2/512 support can be added in the future.

References:
[1] Anatomy of High-Performance Matrix Multiplication: http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
[2] OpenBLAS  https://github.com/xianyi/OpenBLAS

Happy New Year!

Ilya

December 26, 2015

Re: 2016Q1: std.blas

Posted by Ilya Yaroshenko
in reply to Ilya Yaroshenko

Ilya Yaroshenko

Posted in reply to Ilya Yaroshenko

Related questions about LDC
http://forum.dlang.org/thread/lcrquwrehuezpxxvquhs@forum.dlang.org

December 27, 2015

Re: 2016Q1: std.blas

Posted by Andrei Amatuni
in reply to Ilya Yaroshenko

Andrei Amatuni

Posted in reply to Ilya Yaroshenko

On Saturday, 26 December 2015 at 19:57:19 UTC, Ilya Yaroshenko wrote:
> Hi,
>
> I will write GEMM and GEMV families of BLAS for Phobos.
>
> [...]

Just want to thank you in advance. Can't wait!

December 27, 2015

Re: 2016Q1: std.blas

Posted by Basile B.
in reply to Ilya Yaroshenko

Basile B.

Posted in reply to Ilya Yaroshenko

On Saturday, 26 December 2015 at 19:57:19 UTC, Ilya Yaroshenko wrote:
>  - allocators support. GEMM requires small internal allocations.
>  - @nogc nothrow pure template functions (depends on allocator)

Do you mean using std.experimental.allocators and something like (IAllocator alloc) as template parameter ?

If so this will mostly not work. Only Mallocator is really @nogc (on phobos master), and maybe only from the next DMD release point, so GDC and LDC not before monthes.

December 27, 2015

Re: 2016Q1: std.blas

Posted by Charles McAnany
in reply to Ilya Yaroshenko

Charles McAnany

Posted in reply to Ilya Yaroshenko

On Saturday, 26 December 2015 at 19:57:19 UTC, Ilya Yaroshenko wrote:
> Hi,
>
> I will write GEMM and GEMV families of BLAS for Phobos.
>
> Goals:
>  - code without assembler
>  - code based on SIMD instructions
>  - DMD/LDC/GDC support
>  - kernel based architecture like OpenBLAS
>  - 85-100% FLOPS comparing with OpenBLAS (100%)
>  - tiny generic code comparing with OpenBLAS
>  - ability to define user kernels
>  - allocators support. GEMM requires small internal allocations.
>  - @nogc nothrow pure template functions (depends on allocator)
>  - optional multithreaded
>  - ability to work with `Slice` multidimensional arrays when stride between elements in vector is greater than 1. In common BLAS matrix strides between rows or columns always equals 1.
>
> Implementation details:
> LDC     all   : very generic D/LLVM IR kernels. AVX/2/512/neon support is out of the box.
> DMD/GDC x86   : kernels for  8 XMM registers based on core.simd
> DMD/GDC x86_64: kernels for 16 XMM registers based on core.simd
> DMD/GDC other : generic kernels without SIMD instructions. AVX/2/512 support can be added in the future.
>
> References:
> [1] Anatomy of High-Performance Matrix Multiplication: http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
> [2] OpenBLAS  https://github.com/xianyi/OpenBLAS
>
> Happy New Year!
>
> Ilya

I am absolutely thrilled! I've been using scid (https://github.com/kyllingstad/scid) and cblas (https://github.com/DlangScience/cblas) in a project, and I can't wait to see a smooth integration in the standard library.

Couple questions:

Why will the functions be nothrow? It seems that if you try to take the determinant of a 3x5 matrix, you should get an exception.

By 'tiny generic code', you mean that DGEMM, SSYMM, CTRMM, etc. all become one function, basically?

You mention that you'll have GEMM and GEMV in your features, do you think we'll get a more complete slice of BLAS/LAPACK in the future, like GESVD and GEES?

If it's not in the plan, I'd be happy to work on re-tooling scid and cblas to feel like std.blas. (That is, mimic how you choose to represent a matrix, throw the same type of exceptions, etc. But still use external libraries.)

Thanks again for this!

December 27, 2015

Re: 2016Q1: std.blas

Posted by Ilya Yaroshenko
in reply to Basile B.

Ilya Yaroshenko

Posted in reply to Basile B.

On Sunday, 27 December 2015 at 05:23:27 UTC, Basile B. wrote:
> On Saturday, 26 December 2015 at 19:57:19 UTC, Ilya Yaroshenko wrote:
>>  - allocators support. GEMM requires small internal allocations.
>>  - @nogc nothrow pure template functions (depends on allocator)
>
> Do you mean using std.experimental.allocators and something like (IAllocator alloc) as template parameter ?
>
> If so this will mostly not work. Only Mallocator is really @nogc (on phobos master), and maybe only from the next DMD release point, so GDC and LDC not before monthes.

Mallocator is only base to build various user defined allocators with building blocks like freelist. I hope to create std.blas module without Phobos and core.memory&core.thread dependencies, so it can be used like C library. std.allocator usage is optionally.

Ilya

December 27, 2015

Re: 2016Q1: std.blas

Posted by Ilya Yaroshenko
in reply to Charles McAnany

Ilya Yaroshenko

Posted in reply to Charles McAnany

On Sunday, 27 December 2015 at 05:43:47 UTC, Charles McAnany wrote:
> On Saturday, 26 December 2015 at 19:57:19 UTC, Ilya Yaroshenko wrote:
>> Hi,
>>
>> I will write GEMM and GEMV families of BLAS for Phobos.
>>
[...]
>>
>> References:
>> [1] Anatomy of High-Performance Matrix Multiplication: http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
>> [2] OpenBLAS  https://github.com/xianyi/OpenBLAS
>>
>> Happy New Year!
>>
>> Ilya
>
> I am absolutely thrilled! I've been using scid (https://github.com/kyllingstad/scid) and cblas (https://github.com/DlangScience/cblas) in a project, and I can't wait to see a smooth integration in the standard library.
>
> Couple questions:
>
> Why will the functions be nothrow? It seems that if you try to take the determinant of a 3x5 matrix, you should get an exception.

Determinant is a part of LAPACK API, but not BLAS API. BTW, D scientific code should not throw exceptions if it is possible, because it can be integrated with C projects.

> By 'tiny generic code', you mean that DGEMM, SSYMM, CTRMM, etc. all become one function, basically?

No, it is about portability and optimisation. OpenBLAS has huge code base it is written in assembler for various platforms. I want to make generalise optimisation logic.

> You mention that you'll have GEMM and GEMV in your features, do you think we'll get a more complete slice of BLAS/LAPACK in the future, like GESVD and GEES?

LAPACK can be implemented as standalone package. I hope that I will have a time to work on it. Another way is to define new part of Phobos with sci.* suffix.

> If it's not in the plan, I'd be happy to work on re-tooling scid and cblas to feel like std.blas. (That is, mimic how you choose to represent a matrix, throw the same type of exceptions, etc. But still use external libraries.)

It will be cool to see scid with Matrix type replaced by Slice!(2, double*) / Slice!(2, float*). I will argue to do not use any kind of Matrix type, but upcoming Slice https://github.com/D-Programming-Language/phobos/pull/3397. Slice!(2, double*) is generalisation of matrix type with two strides, one for rows and one for columns. std.blas can be implemented to support this feature out of the box. Slice!(2, double*) do not need to have transposed flag (transpose operator only swaps strides and lengths) and fortran_vs_C flag (column based vs row based) is deprecated rudiment.

Ilya

December 27, 2015

Re: 2016Q1: std.blas

Posted by Russel Winder
in reply to Ilya Yaroshenko

Russel Winder

Posted in reply to Ilya Yaroshenko

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2015-12-26 at 19:57 +0000, Ilya Yaroshenko via Digitalmars-d- announce wrote:
> Hi,
> 
> I will write GEMM and GEMV families of BLAS for Phobos.
> 
> Goals:
>   - code without assembler
>   - code based on SIMD instructions
>   - DMD/LDC/GDC support
>   - kernel based architecture like OpenBLAS
>   - 85-100% FLOPS comparing with OpenBLAS (100%)
>   - tiny generic code comparing with OpenBLAS
>   - ability to define user kernels
>   - allocators support. GEMM requires small internal allocations.
>   - @nogc nothrow pure template functions (depends on allocator)
>   - optional multithreaded
>   - ability to work with `Slice` multidimensional arrays when
> stride between elements in vector is greater than 1. In common
> BLAS matrix strides between rows or columns always equals 1.

Shouldn't to goal of a project like this be to be something that OpenBLAS isn't? Given D's ability to call C and C++ code, it is not clear to me that simply rewriting OpenBLAS in D has any goal for the D or BLAS communities per se. Doesn't stop it being a fun activity for the programmer, obviously, but unless there is something that isn't in OpenBLAS, I cannot see this ever being competition and so building a community around the project.

Now if the threads/OpenCL/CUDA was front and centre so that a goal was to be Nx faster than OpenBLAS, that could be a goal worth standing behind.

Not to mention full N-dimension vectors so that D could seriously compete against Numpy in the Python world.

> Implementation details:
> LDC     all   : very generic D/LLVM IR kernels. AVX/2/512/neon
> support is out of the box.
> DMD/GDC x86   : kernels for  8 XMM registers based on core.simd
> DMD/GDC x86_64: kernels for 16 XMM registers based on core.simd
> DMD/GDC other : generic kernels without SIMD instructions.
> AVX/2/512 support can be added in the future.
> 
> References:
> [1] Anatomy of High-Performance Matrix Multiplication:
> http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.
> pdf
> [2] OpenBLAS  https://github.com/xianyi/OpenBLAS
> 
> Happy New Year!
> 
> Ilya
-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

December 27, 2015

Re: 2016Q1: std.blas

Posted by Ilya Yaroshenko
in reply to Russel Winder

Ilya Yaroshenko

Posted in reply to Russel Winder

On Sunday, 27 December 2015 at 10:28:53 UTC, Russel Winder wrote:
> On Sat, 2015-12-26 at 19:57 +0000, Ilya Yaroshenko via Digitalmars-d- announce wrote:
>> Hi,
>> 
>> I will write GEMM and GEMV families of BLAS for Phobos.
>> 
>> Goals:
>>   - code without assembler
>>   - code based on SIMD instructions
>>   - DMD/LDC/GDC support
>>   - kernel based architecture like OpenBLAS
>>   - 85-100% FLOPS comparing with OpenBLAS (100%)
>>   - tiny generic code comparing with OpenBLAS
>>   - ability to define user kernels
>>   - allocators support. GEMM requires small internal allocations.
>>   - @nogc nothrow pure template functions (depends on allocator)
>>   - optional multithreaded
>>   - ability to work with `Slice` multidimensional arrays when
>> stride between elements in vector is greater than 1. In common
>> BLAS matrix strides between rows or columns always equals 1.
>
> Shouldn't to goal of a project like this be to be something that OpenBLAS isn't? Given D's ability to call C and C++ code, it is not clear to me that simply rewriting OpenBLAS in D has any goal for the D or BLAS communities per se. Doesn't stop it being a fun activity for the programmer, obviously, but unless there is something that isn't in OpenBLAS, I cannot see this ever being competition and so building a community around the project.

It depends on what you mean with "something like this". OpenBLAS is _huge_ amount of assembler code. For _each_ platform for _each_ CPU generation for _each_ floating point / complex type it would have a kernel or few kernels. It is 30 MB of assembler code.

Not only D code can call C/C++, but also C/C++ (and so any other language) can call D code. So  std.blas may be used in C/C++ projects like Julia.

> Now if the threads/OpenCL/CUDA was front and centre so that a goal was to be Nx faster than OpenBLAS, that could be a goal worth standing behind.

It can be goal for standalone project. But standard library should be portable on any platform without significant problems (especially without problems caused by matrix multiplication). So my goal is tiny and portable project like ATLAS, but fast like OpenBLAS. BTW, threads in std.blas would be optional like in OpenBLAS. Futhermore std.blas will allow a user to write his own kernels.

> Not to mention full N-dimension vectors so that D could seriously compete against Numpy in the Python world.

I am not sure how D can compete against Numpy in the Python world, but it can compete Python in world of programming languages. BTW, N-dimension ranges/arrays/vectors already implemented for Phobos:

PR:
https://github.com/D-Programming-Language/phobos/pull/3397

Updated Docs:
http://dtest.thecybershadow.net/artifact/website-76234ca0eab431527327d5ce1ec0ad74c6421533-fedfc857090c1c873b17e7a1e4cf853c/web/phobos-prerelease/std_experimental_ndslice.html

Please participate in voting (time constraints is extended) :-) http://forum.dlang.org/thread/nexiojzouxtawdwnlfvt@forum.dlang.org

Ilya

December 27, 2015

Re: 2016Q1: std.blas

Posted by Andrei Alexandrescu
in reply to Basile B.

Andrei Alexandrescu

Posted in reply to Basile B.

On 12/27/15 12:23 AM, Basile B. wrote:
> On Saturday, 26 December 2015 at 19:57:19 UTC, Ilya Yaroshenko wrote:
>>  - allocators support. GEMM requires small internal allocations.
>>  - @nogc nothrow pure template functions (depends on allocator)
>
> Do you mean using std.experimental.allocators and something like
> (IAllocator alloc) as template parameter ?
>
> If so this will mostly not work. Only Mallocator is really @nogc (on
> phobos master), and maybe only from the next DMD release point, so GDC
> and LDC not before monthes.

There are also Mmap- and Sbrk-based allocators. -- Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation