BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

I have been trying to come up with a convincing use case for the new mixins (and for metaprogramming in general). My best effort to date can be found at:
http://www.dsource.org/projects/mathextra/browser/trunk/mathextra/Blade.d

It generates near-optimal x87 asm code for BLAS1-style basic vector operations. 32, 64 and 80 bit vectors are all supported.

Compile with -version=BladeDebug to see the asm code which is generated.

Typical usage:

void main()
{
    auto p = Vec([1.0L, 2, 18]);    // a vector of 80-bit reals.
    auto q = Vec([3.5L, 1.1, 3.8]);  // ditto
    auto r = Vec([17.0f, 28.25, 1]); // a vector of 32-bit floats
    auto z = Vec([17.0i, 28.1i, 1i]); // a vector of 64-bit idoubles
    real d = dot(r, p+r+r);
    ireal e = dot(r, z);
    q -= ((r+p)*18.0L*314.1L - (p-r))* 35;
    d = dot(r, p+r+r);
}

Notice that mixed-length operations (real[] + float[] - double[]) are supported.

Like the C++ Blitz++ library, expression templates are used to convert vector expressions into efficient element-wise operations. Unlike that library, however, there is no reliance on the compiler's optimiser. Instead, the expression template is manipulated as text, converted into postfix, and then passed to a simple CTFE compile-time assembler, which creates highly efficient asm code which is used as a mixin.
To understand the later parts of the code, you need some knowledge of x87 assembler. In fact, you probably need to have read Agner Fog's superb Pentium optimisation manual (www.agner.org).

Some observations:
* I was amazed at how simple the expression template code is (it is somewhat cluttered by the code to check for real/imaginary type mismatch errors).
* I've often read that the x87 floating-point stack is notoriously difficult for compilers to write code for, but it works quite well in this case.
* The major workarounds are:
- inability to use a tuple element directly from asm code (bug #1028);
- inability to define operators for built-in arrays (hence the use of 'Vec' wrappers).
- inability to index through a tuple in a CTFE function (solved by converting types into a string).
* There have been mutterings about how unhygenic/dangerous the new mixins are. In this case, the mixin forms the _entire_ body of the function. This is an interesting situation which I think a language purist will find more palatable.

Enjoy.

April 03, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Walter Bright
in reply to Don Clugston

Permalink

Walter Bright

Posted in reply to Don Clugston

Permalink

Don Clugston wrote:
> I have been trying to come up with a convincing use case for the new mixins (and for metaprogramming in general). My best effort to date can be found at:
> http://www.dsource.org/projects/mathextra/browser/trunk/mathextra/Blade.d

This is pretty incredible.

Can I beg you to write an article about it?

April 03, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Walter Bright
in reply to Don Clugston

Permalink

Walter Bright

Posted in reply to Don Clugston

Permalink

Don Clugston wrote:
> * I've often read that the x87 floating-point stack is notoriously difficult for compilers to write code for, but it works quite well in this case.

The hard part is doing register allocation. Generating code for the stack machine is trivial. There are some complications for dealing with 87 stack overflows.

April 03, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Pragma
in reply to Don Clugston

Permalink

Pragma

Posted in reply to Don Clugston

Permalink

Don Clugston wrote:
> I have been trying to come up with a convincing use case for the new mixins (and for metaprogramming in general). My best effort to date can be found at:
> http://www.dsource.org/projects/mathextra/browser/trunk/mathextra/Blade.d
> 
> It generates near-optimal x87 asm code for BLAS1-style basic vector operations. 32, 64 and 80 bit vectors are all supported.
> 
> Compile with -version=BladeDebug to see the asm code which is generated.
> 
> Typical usage:
> 
> void main()
> {
>     auto p = Vec([1.0L, 2, 18]);    // a vector of 80-bit reals.
>     auto q = Vec([3.5L, 1.1, 3.8]);  // ditto
>     auto r = Vec([17.0f, 28.25, 1]); // a vector of 32-bit floats
>     auto z = Vec([17.0i, 28.1i, 1i]); // a vector of 64-bit idoubles
>     real d = dot(r, p+r+r);
>     ireal e = dot(r, z);
>     q -= ((r+p)*18.0L*314.1L - (p-r))* 35;
>     d = dot(r, p+r+r);
> }
> 
> Notice that mixed-length operations (real[] + float[] - double[]) are supported.
> 
> Like the C++ Blitz++ library, expression templates are used to convert vector expressions into efficient element-wise operations. Unlike that library, however, there is no reliance on the compiler's optimiser. Instead, the expression template is manipulated as text, converted into postfix, and then passed to a simple CTFE compile-time assembler, which creates highly efficient asm code which is used as a mixin.
> To understand the later parts of the code, you need some knowledge of x87 assembler. In fact, you probably need to have read Agner Fog's superb Pentium optimisation manual (www.agner.org).
> 
> Some observations:
> * I was amazed at how simple the expression template code is (it is somewhat cluttered by the code to check for real/imaginary type mismatch errors).
> * I've often read that the x87 floating-point stack is notoriously difficult for compilers to write code for, but it works quite well in this case.
> * The major workarounds are:
> - inability to use a tuple element directly from asm code (bug #1028);
> - inability to define operators for built-in arrays (hence the use of 'Vec' wrappers).
> - inability to index through a tuple in a CTFE function (solved by converting types into a string).
> * There have been mutterings about how unhygenic/dangerous the new mixins are. In this case, the mixin forms the _entire_ body of the function. This is an interesting situation which I think a language purist will find more palatable.
> 
> Enjoy.

This is a work of art Don - it's practically a compiler extension.  Nice job. :)

For others that are interested in how this actually gets the job done, I found this in your documentation:

* THEORY:
* Expression templates are used to create an expression string of the form "(a+b*c)+d"
* and a tuple, the entries of which correspond to a, b, c, d, ...
* This string is converted to postfix. The postfix string is converted to
* a string containing x87 asm, which is then mixed into a function which accepts the tuple.

-- 
- EricAnderton at yahoo

April 03, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Witold Baryluk
in reply to Don Clugston

Permalink

Witold Baryluk

Posted in reply to Don Clugston

Permalink

Hello,

Very good work. I'm impressed. Actually i was developing very
similar program, but yours is practicly all I was needing.
I'm developing linear algebra package in D, and now I'm optimising
it for vector machines, so your BLAS1-like pacakge will be
helpful. :)

Thx.

-- 
Witold Baryluk

April 04, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Davidl
in reply to Don Clugston

Permalink

Davidl

Posted in reply to Don Clugston

Permalink

i think compilers have feeled a great challenge from compile-time back-end like power

> I have been trying to come up with a convincing use case for the new mixins (and for metaprogramming in general). My best effort to date can be found at:
> http://www.dsource.org/projects/mathextra/browser/trunk/mathextra/Blade.d
>
> It generates near-optimal x87 asm code for BLAS1-style basic vector operations. 32, 64 and 80 bit vectors are all supported.
>
> Compile with -version=BladeDebug to see the asm code which is generated.
>
> Typical usage:
>
> void main()
> {
>      auto p = Vec([1.0L, 2, 18]);    // a vector of 80-bit reals.
>      auto q = Vec([3.5L, 1.1, 3.8]);  // ditto
>      auto r = Vec([17.0f, 28.25, 1]); // a vector of 32-bit floats
>      auto z = Vec([17.0i, 28.1i, 1i]); // a vector of 64-bit idoubles
>      real d = dot(r, p+r+r);
>      ireal e = dot(r, z);
>      q -= ((r+p)*18.0L*314.1L - (p-r))* 35;
>      d = dot(r, p+r+r);
> }
>
> Notice that mixed-length operations (real[] + float[] - double[]) are supported.
>
> Like the C++ Blitz++ library, expression templates are used to convert vector expressions into efficient element-wise operations. Unlike that library, however, there is no reliance on the compiler's optimiser. Instead, the expression template is manipulated as text, converted into postfix, and then passed to a simple CTFE compile-time assembler, which creates highly efficient asm code which is used as a mixin.
> To understand the later parts of the code, you need some knowledge of x87 assembler. In fact, you probably need to have read Agner Fog's superb Pentium optimisation manual (www.agner.org).
>
> Some observations:
> * I was amazed at how simple the expression template code is (it is somewhat cluttered by the code to check for real/imaginary type mismatch errors).
> * I've often read that the x87 floating-point stack is notoriously difficult for compilers to write code for, but it works quite well in this case.
> * The major workarounds are:
> - inability to use a tuple element directly from asm code (bug #1028);
> - inability to define operators for built-in arrays (hence the use of 'Vec' wrappers).
> - inability to index through a tuple in a CTFE function (solved by converting types into a string).
> * There have been mutterings about how unhygenic/dangerous the new mixins are. In this case, the mixin forms the _entire_ body of the function. This is an interesting situation which I think a language purist will find more palatable.
>
> Enjoy.

April 04, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Don Clugston
in reply to Witold Baryluk

Permalink

Don Clugston

Posted in reply to Witold Baryluk

Permalink

Witold Baryluk wrote:
> Hello,
> 
> Very good work. I'm impressed. Actually i was developing very
> similar program, but yours is practicly all I was needing.
> I'm developing linear algebra package in D, and now I'm optimising
> it for vector machines, so your BLAS1-like pacakge will be
> helpful. :)

Do you intend to implement BLAS2 and BLAS3-like functionality? I feel that I don't know enough about cache-efficient matrix blocking techniques to be confident in presenting code. I don't know how sophisticated the techniques are in libraries like ATLAS, but the ones used in Blitz++ should be easy to compete with.

Blade is still mostly proof-of-concept rather than being industrial-strength -- there are so many possibilities for improvement, it's not well tested, and I haven't even put any effort into making the code look nice. But as Davidl said, it shows that hard-core back-end optimisation can now be done in a library at compile time.

The code is quite modularised, so it would be straightforward to add a check for 'is it SSE-able' (ie, are all the vectors floats) and 'is it SSE2-able' (are all the vectors doubles), and if so, send them off into specialised assemblers, otherwise send them to the x87 one. The postfix-ing step will involve searching for *+ combinations where fma can be used. The ability to know the vector length in many cases can be a huge advantage for SSE code, where loop unrolling by a factor of 2 or 4 is always necessary. I haven't got that far yet.

I'm pretty sure that this type of technique can blow almost everything else out of the water. <g>

April 04, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Don Clugston
in reply to Walter Bright

Permalink

Don Clugston

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> Don Clugston wrote:
>> I have been trying to come up with a convincing use case for the new mixins (and for metaprogramming in general). My best effort to date can be found at:
>> http://www.dsource.org/projects/mathextra/browser/trunk/mathextra/Blade.d
> 
> This is pretty incredible.

I thought you'd like it. The sheer power of D these days...
"Flies like a rocket, steers like a bicycle."

> Can I beg you to write an article about it?

OK. Are you thinking somewhere like DDJ?

It's nicely focused (central idea being something like "D's metaprogramming is so powerful, that compiler back-end optimisation can easily be performed in compile-time libraries") -- but it would require loads of explanation of D's coolest features. Could be a challenge to make it easily readable.

But with the proposed changes to 'const', macros, and previous discussion about whether tuples should be automatically flattened, I have doubts over how stable that part of the language is. I wouldn't want to write something that didn't compile any more, or looked hopelessly obsolete by the time it was published.
I've been bitten before -- CTFE blasted almost all of my previous metaprogramming code into the stone age <g>. What's the shelf life of this stuff?.

April 04, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Georg Wrede
in reply to Don Clugston

Permalink

Georg Wrede

Posted in reply to Don Clugston

Permalink

Don Clugston wrote:
> Could be a challenge to make it easily readable.

I've some experience proofreading and commenting similar articles. Only too happy to oblige!

Also, DDJ readers may not be the CUJ crowd, but still, they can handle advanced stuff. Besides, some similar magazines help you by rewriting the article anyway, because then it has the same overall "feel" that their other articles. Articles in SciAm for example feel the same, no matter who or from what continent is presented as the author.

April 04, 2007

Re: BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Posted by Walter Bright
in reply to Don Clugston

Permalink

Walter Bright

Posted in reply to Don Clugston

Permalink

Don Clugston wrote:
> Walter Bright wrote:
>> Can I beg you to write an article about it?
> 
> OK. Are you thinking somewhere like DDJ?

Anywhere!

> It's nicely focused (central idea being something like "D's metaprogramming is so powerful, that compiler back-end optimisation can easily be performed in compile-time libraries") -- but it would require loads of explanation of D's coolest features. Could be a challenge to make it easily readable.
> 
> But with the proposed changes to 'const', macros, and previous discussion about whether tuples should be automatically flattened, I have doubts over how stable that part of the language is. I wouldn't want to write something that didn't compile any more, or looked hopelessly obsolete by the time it was published.
> I've been bitten before -- CTFE blasted almost all of my previous metaprogramming code into the stone age <g>. What's the shelf life of this stuff?.

Good question. I've already fixed #1028. But even if it looks stone age 6 months from now, it's still a decade ahead of any other language.

Top | Forum index | About this forum

Forums