November 20, 2007
No, I don't have the perfect Matrix. I just want to explore the question of whether it exists.
There are a plethora of C++ matrix libraries, all of which are far from perfect; they all involve trade-offs between various language limitations.
In particular, the usage of expression templates creates all kinds of evil.

Requirements:
1. Must be compatible with future multi-dimensional array syntax such as
double [7,5] m;
2. Must be theoretically capable of optimal performance.
3. Needs to support slices. This means we also need to support strided
vectors. Therefore we also need matrix slices, and matrix references.
Probably distinguishing between the T[] and T[new] concepts.
4. For storage we need packed, triangular, symmetric, banded, and sparse
matrix support. (Possibly even 'calculated' matrix support, for
user-defined matrix types which generate entries on-demand).
5. Can be generalized to tensors.
6. Nice error messages.

I now have enough experience with my BLADE library to be confident that all of these features can be achieved in existing D. In particular, the crucial requirement 2 can be completely decoupled from the others.

But, there are many requirements (dozens?) which are not on my brief list, and perhaps they cannot all be met simultaneously, even in theory. Yet I'm not convinced that it's impossible. What are the other requirements?
November 20, 2007
Don Clugston wrote:
> No, I don't have the perfect Matrix. I just want to explore the question of whether it exists.

A very interesting topic indeed. I will just give some quick thoughts to start with. First of all, a matrix is (generally) 2-dimensional, but there are many uses for higher-dimensional structures as well. Also, the 1 dimensional vector is a useful structure that could be handled in the same generic framework. Is the target just 2-dimensional matrices or also higher dimensional arrays?

The matrix as a mathematical structure is quite different in semantics and use from generic multi-dimensional arrays and should perhaps be treated separately.

To avoid confusion, I will use "size" to refer to the number of elements of a matrix/array and "dimensionality" to refer to the number of dimensions of an array (a vector is 1-dimensional, a matrix is 2-dimensional).

First, there is the separation of compile time and run time. One can identify (at least) 3 different needs, with different requirements for run-time/compile-time dynamics:

* Small static matrices/arrays (where the size and dimensionality are known at compile time)
* Dynamic matrices/arrays (where the sizes are determined at run-time, but dimensionality is fixed at compile time)
* Fully dynamic multidimensional arrays (where the dimensionality is determined at run time)

The third case could probably in most cases be left out, but the first two make an interesting separation. Knowing the size at compile time is mostly an advantage for smaller matrices (eg 4x4, as commonly used for geometric translations). Larger matrices (eg 1000x1000) probably gain much less from compile time known sizes. There is also the matter of template bloat that means that compile time quantities are best left to the smaller numbers.

> Requirements:
> 1. Must be compatible with future multi-dimensional array syntax such as
> double [7,5] m;
> 2. Must be theoretically capable of optimal performance.

Seems reasonable.

> 3. Needs to support slices. This means we also need to support strided
> vectors. Therefore we also need matrix slices, and matrix references.

Yes, both strided views and non-strided references are definitely needed.

> Probably distinguishing between the T[] and T[new] concepts.

Such a distinction is definitely of an advantage. For large matrices, ownership of memory becomes important, as such structures may be too large to be able to rely on the GC.

> 4. For storage we need packed, triangular, symmetric, banded, and sparse
> matrix support. (Possibly even 'calculated' matrix support, for
> user-defined matrix types which generate entries on-demand).

For generality, at least sparse (and possibly calculated) are useful for vectors and higher dimensional arrays too.

> 5. Can be generalized to tensors.
> 6. Nice error messages.
> 
> I now have enough experience with my BLADE library to be confident that all of these features can be achieved in existing D. In particular, the crucial requirement 2 can be completely decoupled from the others.

Sounds excellent.

> But, there are many requirements (dozens?) which are not on my brief list, and perhaps they cannot all be met simultaneously, even in theory. Yet I'm not convinced that it's impossible. What are the other requirements?

Naturally, different element types (as well UDT). Compatibilty with external libraries (Fortran/C) with the possibility of switching between Fortran and C layout.

-- 
Oskar
November 20, 2007
> Requirements:
> 1. Must be compatible with future multi-dimensional array syntax such as
> double [7,5] m;
> 2. Must be theoretically capable of optimal performance.
> 3. Needs to support slices. This means we also need to support strided
> vectors. Therefore we also need matrix slices, and matrix references.
> Probably distinguishing between the T[] and T[new] concepts.
> 4. For storage we need packed, triangular, symmetric, banded, and sparse
> matrix support. (Possibly even 'calculated' matrix support, for
> user-defined matrix types which generate entries on-demand).
> 5. Can be generalized to tensors.
> 6. Nice error messages.

Another requirement would be elegant syntax that "fits" with D syntax.  Is that what you were trying to communicate with requirement 1?  If I remember correctly doesn't BLAS syntax look a little funny because it relies on mixins?

If you want requirement 2 then you must be able to leverage any available GPU's or GPGPU's.  This is going to be especially useful when Nvidia and AMD release their GPU-like processors that will be able to do double precision floating point operations.  Nvidia's (yet unnamed) processor is due out in December.  AMD's Firestorm is due in the first quarter of '08.  My company has plans to develop a matrix library for physics simulations that will support these new processors.  However, if one was already available for us then we wouldn't need to develop it.

-Craig 

November 20, 2007
Don Clugston wrote:
> [about TMtEAMC]

Very interesting indeed! I do a lot of scientific
computation with Matlab, and I'd love to have a
powerful matrix lib for D. It's just I'm currently
swamped with other work, so I couldn't be much
help... :-(
The first thing that springs in my mind is which
operations to choose for operator overloading?
Given that a distinction is made between column
and row vectors, we can use the operand order
to choose between inner and outer product.
But what about element-wise product?
Same goes for division.
Maybe "fake infix operators" could be used?

regards, frank
November 20, 2007
> Another requirement would be elegant syntax that "fits" with D syntax.  Is
> that what you were trying to communicate with requirement 1?  If I remember correctly doesn't BLAS syntax look a little funny because it relies on mixins?

Er, I meant BLADE not BLAS. 

November 20, 2007
For reference, here's the state of BLADE:

Syntax is:

mixin(vectorize("expression"));

eg,

mixin(vectorize("some_array -= 3.2 * (5.8 * another_array) * some_constant"));

It would be so nice to get rid of the mixin() and the quotes. But they don't impede development in any way. Ideally it would be:

vectorize(some_array -= 3.2 * (5.8 * another_array) * some_constant);

Most significant changes since I my last post about it.
(a) it finally works in a usable form :-). Thanks Walter for some great bugfixes in the past few releases!  I'm using dmd 1.023.
(b) generates X87, SSE, SSE2, or inline D code, depending on the complexity of the expression and the types involved.
(c) an expression rewriting step has been added, which performs scalar/const folding, and indexing/slicing folding
   (eg, (A+7*B)[0..5] --->   A[0..5] + 7*(B[0..5]) ).
(d) it gives really, really nice error messages and debug output. No template garbage from the library internals -- just a straightforward one-line error message, giving the line in YOUR code which contains the error.

eg,   with an array a[],
    mixin(vectorize("any+= old*garbage"));
    mixin(vectorize("a+= 2"));

you get this output (and NOTHING ELSE):
demo.d(38): static assert  "BLADE: Undefined symbols: any old garbage"
demo.d(39): static assert  "BLADE: Rank mismatch (addition or subtraction)"

It's still not terribly useful, since it only generates code for packed vectors. But the infrastructure is very solid.

Here's a particularly nasty example which the constant folding can cope with, to generate SSE2 code:

    double [] a = new double[4];
    double [] d = [0.5, 2.8, 17.0, 28.25, 1, 56.2, 3.4];
    a[0..$] = [3.4, 565, 31.3, 41.8];
    double [4][] another = [[33.1, 4543, 43, 878.7], [5.14, 455, 554, 2.43]];

    mixin(vectorize(
` a += (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]`));

-------
Generates this front-end code (compile with -debug=BladeFrontEnd to see it). Note that there are many asserts to give nice debug info at runtime, but the only runtime code is a single function call, which passes 3 pointers and a double into an asm function (there's no inlining work for the compiler to do):

------
// bladedemo.d(34)  a += (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]
assert(a.length==another[][1][(3u-3)..$].length, `Vector length mismatch`);
assert(d[2..($-1)][(3u-3)..$].length==another[][1][(3u-3)..$].length, `Vector length mismatch`);
assert( (cast(size_t)(a.ptr)& 0x0F) == 0, `SSE Vector misalignment: a`);
assert( (cast(size_t)(d[2..($-1)][(3u-3)..$].ptr)& 0x0F) == 0, `SSE Vector misalignment: d[2..($-1)][(3u-3)..$]`);
assert( (cast(size_t)(another[][1][(3u-3)..$].ptr)& 0x0F) == 0, `SSE Vector misalignment: another[][1][(3u-3)..$]`);

SSEVECGEN!(2,"A+=((B*C)-D)",double*,double,double*,double*)(another[][1][(3u-3)..$].length,&a[0],((a[2])*2.01),&d[2..($-1)][(3u-3)..$][0],&another[][1][(3u-3)..$][0]);

-----
The function consists of this ASM code (compile with -debug=BladeBackEnd to see it; BTW all comments are auto-generated). Note there are only 8 asm instructions in the inner loop:

-------
// Operation : ACB*D-+A=

asm {
 push EBX;
  mov EAX, veclength;
  lea ECX, [8*EAX];     add ECX, values[0];  //A
  movsd XMM0, values[1];   shufpd XMM0, XMM0,0; //B
  lea EDX, [8*EAX];     add EDX, values[2];  //C
  lea EBX, [8*EAX];     add EBX, values[3];  //D
  xor EAX, EAX;
  sub EAX, veclength; // counter=-length
  jz short L2; // test for length==0

  align 16;
L1:
  movapd XMM1, [ECX + 8*EAX];  // A
  movapd XMM2, [EDX + 8*EAX];  // C
  mulpd XMM2, XMM0; // B*
  subpd  XMM2, [EBX + 8*EAX];  // D-
  addpd XMM1, XMM2;  //+
  movapd [ECX + 8*EAX], XMM1;  // A=
  add EAX,2;
  js L1;
L2:
  sub EAX, 2;
  jns L4;
  movsd XMM1, [ECX + 8*EAX+16];  // A
  movsd XMM2, [EDX + 8*EAX+16];  // C
  mulsd XMM2, XMM0; // B*
  subsd  XMM2, [EBX + 8*EAX+16];  // D-
  addsd XMM1, XMM2;  //+
  movsd [ECX + 8*EAX+16], XMM1;  // A=
L4:
;  pop EBX;
}
-------
November 20, 2007
"Don Clugston" <dac@nospam.com.au> wrote in message news:fhv00l$13eq$1@digitalmars.com...
> For reference, here's the state of BLADE:
>
> Syntax is:
>
> mixin(vectorize("expression"));
>
> eg,
>
> mixin(vectorize("some_array -= 3.2 * (5.8 * another_array) * some_constant"));
>
> It would be so nice to get rid of the mixin() and the quotes. But they don't impede development in any way. Ideally it would be:
>
> vectorize(some_array -= 3.2 * (5.8 * another_array) * some_constant);
>
> Most significant changes since I my last post about it.
> (a) it finally works in a usable form :-). Thanks Walter for some great
> bugfixes in the past few releases!  I'm using dmd 1.023.
> (b) generates X87, SSE, SSE2, or inline D code, depending on the
> complexity of the expression and the types involved.
> (c) an expression rewriting step has been added, which performs
> scalar/const folding, and indexing/slicing folding
>    (eg, (A+7*B)[0..5] --->   A[0..5] + 7*(B[0..5]) ).
> (d) it gives really, really nice error messages and debug output. No
> template garbage from the library internals -- just a straightforward
> one-line error message, giving the line in YOUR code which contains the
> error.
>
> eg,   with an array a[],
>     mixin(vectorize("any+= old*garbage"));
>     mixin(vectorize("a+= 2"));
>
> you get this output (and NOTHING ELSE):
> demo.d(38): static assert  "BLADE: Undefined symbols: any old garbage"
> demo.d(39): static assert  "BLADE: Rank mismatch (addition or
> subtraction)"
>
> It's still not terribly useful, since it only generates code for packed vectors. But the infrastructure is very solid.
>
> Here's a particularly nasty example which the constant folding can cope with, to generate SSE2 code:
>
>     double [] a = new double[4];
>     double [] d = [0.5, 2.8, 17.0, 28.25, 1, 56.2, 3.4];
>     a[0..$] = [3.4, 565, 31.3, 41.8];
>     double [4][] another = [[33.1, 4543, 43, 878.7], [5.14, 455, 554,
> 2.43]];
>
>     mixin(vectorize(
> ` a += (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]`));
>
> -------
> Generates this front-end code (compile with -debug=BladeFrontEnd to see it). Note that there are many asserts to give nice debug info at runtime, but the only runtime code is a single function call, which passes 3 pointers and a double into an asm function (there's no inlining work for the compiler to do):
>
> ------
> // bladedemo.d(34)  a +=
> (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]
> assert(a.length==another[][1][(3u-3)..$].length, `Vector length
> mismatch`);
> assert(d[2..($-1)][(3u-3)..$].length==another[][1][(3u-3)..$].length,
> `Vector length mismatch`);
> assert( (cast(size_t)(a.ptr)& 0x0F) == 0, `SSE Vector misalignment: a`);
> assert( (cast(size_t)(d[2..($-1)][(3u-3)..$].ptr)& 0x0F) == 0, `SSE Vector
> misalignment: d[2..($-1)][(3u-3)..$]`);
> assert( (cast(size_t)(another[][1][(3u-3)..$].ptr)& 0x0F) == 0, `SSE
> Vector misalignment: another[][1][(3u-3)..$]`);
>
> SSEVECGEN!(2,"A+=((B*C)-D)",double*,double,double*,double*)(another[][1][(3u-3)..$].length,&a[0],((a[2])*2.01),&d[2..($-1)][(3u-3)..$][0],&another[][1][(3u-3)..$][0]);
>
> -----
> The function consists of this ASM code (compile with -debug=BladeBackEnd to see it; BTW all comments are auto-generated). Note there are only 8 asm instructions in the inner loop:
>
> -------
> // Operation : ACB*D-+A=
>
> asm {
>  push EBX;
>   mov EAX, veclength;
>   lea ECX, [8*EAX];     add ECX, values[0];  //A
>   movsd XMM0, values[1];   shufpd XMM0, XMM0,0; //B
>   lea EDX, [8*EAX];     add EDX, values[2];  //C
>   lea EBX, [8*EAX];     add EBX, values[3];  //D
>   xor EAX, EAX;
>   sub EAX, veclength; // counter=-length
>   jz short L2; // test for length==0
>
>   align 16;
> L1:
>   movapd XMM1, [ECX + 8*EAX];  // A
>   movapd XMM2, [EDX + 8*EAX];  // C
>   mulpd XMM2, XMM0; // B*
>   subpd  XMM2, [EBX + 8*EAX];  // D-
>   addpd XMM1, XMM2;  //+
>   movapd [ECX + 8*EAX], XMM1;  // A=
>   add EAX,2;
>   js L1;
> L2:
>   sub EAX, 2;
>   jns L4;
>   movsd XMM1, [ECX + 8*EAX+16];  // A
>   movsd XMM2, [EDX + 8*EAX+16];  // C
>   mulsd XMM2, XMM0; // B*
>   subsd  XMM2, [EBX + 8*EAX+16];  // D-
>   addsd XMM1, XMM2;  //+
>   movsd [ECX + 8*EAX+16], XMM1;  // A=
> L4:
> ;  pop EBX;
> }
> -------

Very very cool Don!  This obviously has huge potential.  Do you have any plans to support GPU's or multicores?  Have you looked into SSE3 or SSE4 to see if there are any new instructions that would be useful?

Double precision support is the most important to me, but I am also interested in single precision and extended precision.  To what degree are they supported?  I assume that SSE instructions can't be used for extended precision operations.

So there is only support for one dimensional arrays?  When will two dimensional arrays be supported?  Once two dimensional arrays are implemented, a simple test for the code generator would be to generate optimized code for single precision multiplication of 3x3 or 4x4 matrices. Optimal code for single precision 3x3 and 4x4 matrix multiplication is widely available on the internet.

-Craig


November 20, 2007
"Don Clugston" <dac@nospam.com.au> wrote in message news:fhv00l$13eq$1@digitalmars.com...
> For reference, here's the state of BLADE:
>
> Syntax is:
>
> mixin(vectorize("expression"));
>
> eg,
>
> mixin(vectorize("some_array -= 3.2 * (5.8 * another_array) * some_constant"));
>
> It would be so nice to get rid of the mixin() and the quotes. But they don't impede development in any way. Ideally it would be:
>
> vectorize(some_array -= 3.2 * (5.8 * another_array) * some_constant);
>
> Most significant changes since I my last post about it.
> (a) it finally works in a usable form :-). Thanks Walter for some great
> bugfixes in the past few releases!  I'm using dmd 1.023.
> (b) generates X87, SSE, SSE2, or inline D code, depending on the
> complexity of the expression and the types involved.
> (c) an expression rewriting step has been added, which performs
> scalar/const folding, and indexing/slicing folding
>    (eg, (A+7*B)[0..5] --->   A[0..5] + 7*(B[0..5]) ).
> (d) it gives really, really nice error messages and debug output. No
> template garbage from the library internals -- just a straightforward
> one-line error message, giving the line in YOUR code which contains the
> error.
>
> eg,   with an array a[],
>     mixin(vectorize("any+= old*garbage"));
>     mixin(vectorize("a+= 2"));
>
> you get this output (and NOTHING ELSE):
> demo.d(38): static assert  "BLADE: Undefined symbols: any old garbage"
> demo.d(39): static assert  "BLADE: Rank mismatch (addition or
> subtraction)"
>
> It's still not terribly useful, since it only generates code for packed vectors. But the infrastructure is very solid.
>
> Here's a particularly nasty example which the constant folding can cope with, to generate SSE2 code:
>
>     double [] a = new double[4];
>     double [] d = [0.5, 2.8, 17.0, 28.25, 1, 56.2, 3.4];
>     a[0..$] = [3.4, 565, 31.3, 41.8];
>     double [4][] another = [[33.1, 4543, 43, 878.7], [5.14, 455, 554,
> 2.43]];
>
>     mixin(vectorize(
> ` a += (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]`));
>
> -------
> Generates this front-end code (compile with -debug=BladeFrontEnd to see it). Note that there are many asserts to give nice debug info at runtime, but the only runtime code is a single function call, which passes 3 pointers and a double into an asm function (there's no inlining work for the compiler to do):
>
> ------
> // bladedemo.d(34)  a +=
> (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]
> assert(a.length==another[][1][(3u-3)..$].length, `Vector length
> mismatch`);
> assert(d[2..($-1)][(3u-3)..$].length==another[][1][(3u-3)..$].length,
> `Vector length mismatch`);
> assert( (cast(size_t)(a.ptr)& 0x0F) == 0, `SSE Vector misalignment: a`);
> assert( (cast(size_t)(d[2..($-1)][(3u-3)..$].ptr)& 0x0F) == 0, `SSE Vector
> misalignment: d[2..($-1)][(3u-3)..$]`);
> assert( (cast(size_t)(another[][1][(3u-3)..$].ptr)& 0x0F) == 0, `SSE
> Vector misalignment: another[][1][(3u-3)..$]`);
>
> SSEVECGEN!(2,"A+=((B*C)-D)",double*,double,double*,double*)(another[][1][(3u-3)..$].length,&a[0],((a[2])*2.01),&d[2..($-1)][(3u-3)..$][0],&another[][1][(3u-3)..$][0]);
>
> -----
> The function consists of this ASM code (compile with -debug=BladeBackEnd to see it; BTW all comments are auto-generated). Note there are only 8 asm instructions in the inner loop:
>
> -------
> // Operation : ACB*D-+A=
>
> asm {
>  push EBX;
>   mov EAX, veclength;
>   lea ECX, [8*EAX];     add ECX, values[0];  //A
>   movsd XMM0, values[1];   shufpd XMM0, XMM0,0; //B
>   lea EDX, [8*EAX];     add EDX, values[2];  //C
>   lea EBX, [8*EAX];     add EBX, values[3];  //D
>   xor EAX, EAX;
>   sub EAX, veclength; // counter=-length
>   jz short L2; // test for length==0
>
>   align 16;
> L1:
>   movapd XMM1, [ECX + 8*EAX];  // A
>   movapd XMM2, [EDX + 8*EAX];  // C
>   mulpd XMM2, XMM0; // B*
>   subpd  XMM2, [EBX + 8*EAX];  // D-
>   addpd XMM1, XMM2;  //+
>   movapd [ECX + 8*EAX], XMM1;  // A=
>   add EAX,2;
>   js L1;
> L2:
>   sub EAX, 2;
>   jns L4;
>   movsd XMM1, [ECX + 8*EAX+16];  // A
>   movsd XMM2, [EDX + 8*EAX+16];  // C
>   mulsd XMM2, XMM0; // B*
>   subsd  XMM2, [EBX + 8*EAX+16];  // D-
>   addsd XMM1, XMM2;  //+
>   movsd [ECX + 8*EAX+16], XMM1;  // A=
> L4:
> ;  pop EBX;
> }
> -------

Another question.  How practical would this mixin technique be to generate optimal floating point code for regular floating point operations (not vectors or matrices)?  The DMD backend currently doesn't optimize floating point very well.  Perhaps this approach could be used for floating point operations in general.


November 20, 2007
Craig Black wrote:
> "Don Clugston" <dac@nospam.com.au> wrote in message news:fhv00l$13eq$1@digitalmars.com...
>> For reference, here's the state of BLADE:
>> -------
> 
> Very very cool Don!  This obviously has huge potential.  Do you have any plans to support GPU's or multicores?  Have you looked into SSE3 or SSE4 to see if there are any new instructions that would be useful?

Yes. Although it seems to me that the GPU world is about to change radically -- using OpenGL looks like a horrible hack without long-term future. I'll wait until we get a decent API, I think.
I have been looking at SSE3, 4, and 5, but I don't even have a Core2 machine...

> Double precision support is the most important to me, but I am also interested in single precision and extended precision.  To what degree are they supported?  

Yes.

(b) generates X87, SSE, SSE2, or inline D code, depending on the complexity of the expression and the types involved.

I assume that SSE instructions can't be used for extended
> precision operations.

Yes. I don't think they're even worth doing when strided arrays are involved, except in special cases.

> 
> So there is only support for one dimensional arrays?  When will two dimensional arrays be supported? 

Strided arrays are next, then 2D.

 Once two dimensional arrays are
> implemented, a simple test for the code generator would be to generate optimized code for single precision multiplication of 3x3 or 4x4 matrices. Optimal code for single precision 3x3 and 4x4 matrix multiplication is widely available on the internet.
> 
> -Craig 
> 
> 
November 20, 2007
"Don Clugston" <dac@nospam.com.au> wrote in message news:fhv35u$18nc$1@digitalmars.com...
> Craig Black wrote:
>> "Don Clugston" <dac@nospam.com.au> wrote in message news:fhv00l$13eq$1@digitalmars.com...
>>> For reference, here's the state of BLADE:
>>> -------
>>
>> Very very cool Don!  This obviously has huge potential.  Do you have any plans to support GPU's or multicores?  Have you looked into SSE3 or SSE4 to see if there are any new instructions that would be useful?
>
> Yes. Although it seems to me that the GPU world is about to change
> radically --
> using OpenGL looks like a horrible hack without long-term future. I'll
> wait until we get a decent API, I think.
> I have been looking at SSE3, 4, and 5, but I don't even have a Core2
> machine...
>
>> Double precision support is the most important to me, but I am also interested in single precision and extended precision.  To what degree are they supported?
>
> Yes.
>
> (b) generates X87, SSE, SSE2, or inline D code, depending on the complexity of the expression and the types involved.
>
> I assume that SSE instructions can't be used for extended
>> precision operations.
>
> Yes. I don't think they're even worth doing when strided arrays are involved, except in special cases.
>
>>
>> So there is only support for one dimensional arrays?  When will two dimensional arrays be supported?
>
> Strided arrays are next, then 2D.

What are strided arrays?  Are they similar to sparse arrays?


« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home