Array Operations: a[] + b[] etc.

First things first: I'm not actually sure what the current spec for this is,
http://dlang.org/arrays.html is not the clearest on the subject and seems to rule out a lot of things that I reckon should work.

For this post I'm going to use the latest dmd from github. Behaviour is sometimes quite different for different versions of dmd, let alone gdc or ldc.

e.g.

int[] a = [1,2,3,4];
int[] b = [6,7,8,9];
int[] c;
int[] d = [10];
int[] e = [0,0,0,0];

a[] += b[];       // result [7, 9, 11, 13], as expected.

c = a[] + b[];    // Error: Array operation a[] + b[] not implemented.

c[] = a[] + b[];  // result [], is run-time assert on some compiler(s)/versions
d[] = a[] + b[]   // result [7], also a rt assert for some compiler(s)/versions


My vision of how things could work:
c = a[] opBinary b[];
should be legal. It should create a new array that is then reference assigned to c.

d[] = a[] opBinary b[];
should be d[i] = a[i] + b[i] for all i in 0..length.
What should the length be? Do we silently truncate to the shortest array or do we run-time assert (like ldc2 does, and so did dmd for a while between 2.060 and now). Currently dmd (and gdc) does neither of these reliably, e.g.
d[] = a[] + b[] results in [7],
a[] = d[] + b[] results in [16, 32747, 38805832, 67108873]

Another nice things to be able to do that i miss from working in IDL, I'm not sure how they'd be possible in D though:
given a multidimensional array I should be able to slice and index along any axis.
for example:
int[4][3] m = [[0,1,2,3],
               [4,5,6,7],
               [8,9,10,11]];
I can index vertically, i.e. m[1] == [4,5,6,7], but there's no syntactic sugar for indexing horizontally. Obviously m[][2] just gives me the 3rd row, so what could be a nice concise statement suddenly requires a manually written loop that the compiler has to work it's way through, extracting the meaning (see Walter on this, here: http://www.drdobbs.com/loop-optimizations/229300270)

A possible approach, heavily tried and tested in numpy and IDL: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
http://www.atmos.umd.edu/~gcm/usefuldocs/IDL.html#operations

Use multiple indices within the brackets.
    m[1,2] would be identical to m[1][2], returning 6
    m[0..2,3] would return [3,7]
    m[,2] would give me [2,6,10]
    Alternative syntax could be m[*,2], m[:,2] or we could even require m[0..$,2], I don't know how much of a technical challenge each of these would be for parsing and lexing.

//An example, lets imagine a greyscale image, stored as an array of pixel rows:

double[][] img = read_bmp(fn,"grey");

//we want to crop it to some user defined co-ords (x1,y1),(x2,y2):

//Version A, current syntax

auto img_cropped = img[y1..y2].dup;
foreach(ref row; img_cropped) {
    row = row[x1..x2];
}
//3 lines of code for a very simple idea.

//Version B, new syntax

auto img_cropped = img[y1..y2, x1..x2];

//Very simple, easy to read code that is clear in it's purpose.

I propose that Version B would be equivalent to A: An independent window on the data. Any reassignment of a row (i.e. pointing it to somewhere else, not copying new data in) will have no effect on the data. This scales naturally to higher dimensions and is in agreement with the normal slicing rules: the slice itself is independent of the original, but the data inside is shared.

I believe this would be a significant improvement to D, particularly for image processing and scientific applications.

P.S.
As you can probably tell, I have no experience in compiler design! I may be missing something that makes all of this impossible/impractical. I also don't think this would have to cause any code breakage at all, but again, I could be wrong.

P.P.S.
I think there many be something quite wrong with how the frontend understands current array expression syntax... see here: http://dpaste.dzfl.pl/f4a931db

On 11/21/2012 10:02 AM, John Colvin wrote: > My vision of how things could work: > c = a[] opBinary b[]; > should be legal. It should create a new array that is then reference assigned to c. This is not done because it puts excessive pressure on the garbage collector. Array ops do not allocate memory by design.

On Wednesday, 21 November 2012 at 18:15:51 UTC, Walter Bright wrote: > On 11/21/2012 10:02 AM, John Colvin wrote: >> My vision of how things could work: >> c = a[] opBinary b[]; >> should be legal. It should create a new array that is then reference assigned to c. > > This is not done because it puts excessive pressure on the garbage collector. Array ops do not allocate memory by design. We really need better error messages for this, though – Andrej? ;) David

On Wednesday, 21 November 2012 at 18:15:51 UTC, Walter Bright wrote: > On 11/21/2012 10:02 AM, John Colvin wrote: >> My vision of how things could work: >> c = a[] opBinary b[]; >> should be legal. It should create a new array that is then reference assigned to c. > > This is not done because it puts excessive pressure on the garbage collector. Array ops do not allocate memory by design. Fair enough. When you say excessive pressure, is that performance pressure or design pressure?

On 11/21/12 1:20 PM, John Colvin wrote: > On Wednesday, 21 November 2012 at 18:15:51 UTC, Walter Bright wrote: >> On 11/21/2012 10:02 AM, John Colvin wrote: >>> My vision of how things could work: >>> c = a[] opBinary b[]; >>> should be legal. It should create a new array that is then reference >>> assigned to c. >> >> This is not done because it puts excessive pressure on the garbage >> collector. Array ops do not allocate memory by design. > > Fair enough. When you say excessive pressure, is that performance > pressure or design pressure? Performance pressure - the design here is rather easy if efficiency is not a concern. Andrei

On Wednesday, 21 November 2012 at 18:38:59 UTC, Andrei Alexandrescu wrote: > On 11/21/12 1:20 PM, John Colvin wrote: >> On Wednesday, 21 November 2012 at 18:15:51 UTC, Walter Bright wrote: >>> On 11/21/2012 10:02 AM, John Colvin wrote: >>>> My vision of how things could work: >>>> c = a[] opBinary b[]; >>>> should be legal. It should create a new array that is then reference >>>> assigned to c. >>> >>> This is not done because it puts excessive pressure on the garbage >>> collector. Array ops do not allocate memory by design. >> >> Fair enough. When you say excessive pressure, is that performance >> pressure or design pressure? > > Performance pressure - the design here is rather easy if efficiency is not a concern. > > Andrei In what way does it become a performance problem? Apologies for the naive questions, I have nothing more than a passing understanding of how garbage collection works.

On 11/21/2012 07:02 PM, John Colvin wrote: > //An example, lets imagine a greyscale image, stored as an array of > pixel rows: > > double[][] img = read_bmp(fn,"grey"); > > //we want to crop it to some user defined co-ords (x1,y1),(x2,y2): > > //Version A, current syntax > > auto img_cropped = img[y1..y2].dup; > foreach(ref row; img_cropped) { > row = row[x1..x2]; > } > //3 lines of code for a very simple idea. > > //Version B, new syntax > > auto img_cropped = img[y1..y2, x1..x2]; > > //Very simple, easy to read code that is clear in it's purpose. > > I propose that Version B would be equivalent to A: An independent window > on the data. Any reassignment of a row (i.e. pointing it to somewhere > else, not copying new data in) will have no effect on the data. This > scales naturally to higher dimensions and is in agreement with the > normal slicing rules: the slice itself is independent of the original, > but the data inside is shared. > > I believe this would be a significant improvement to D, particularly for > image processing and scientific applications. If you want to use this syntax with images, DMagick's ImageView might be interesting: http://dmagick.mikewey.eu/docs/ImageView.html -- Mike Wey

On 11/21/2012 10:41 AM, John Colvin wrote: > In what way does it become a performance problem? Allocating memory is always much, much slower than not allocating memory. A design that forces allocating new memory and discarding the old as opposed to reusing existing already allocated memory is going to be far slower. In fact, the allocation/deallocation is going to dominate the performance timings, not the array operation itself. Generally, people who use array operations want them to be really fast.

On 11/21/12, David Nadlinger <see@klickverbot.at> wrote: > We really need better error messages for this, though – Andrej? ;) Considering how slow pulls are being merged.. I don't think it's worth my time hacking on dmd. Anyway I have other things I'm working on now.

On Wednesday, 21 November 2012 at 20:16:59 UTC, Walter Bright wrote: > On 11/21/2012 10:41 AM, John Colvin wrote: >> In what way does it become a performance problem? > > Allocating memory is always much, much slower than not allocating memory. > > A design that forces allocating new memory and discarding the old as opposed to reusing existing already allocated memory is going to be far slower. In fact, the allocation/deallocation is going to dominate the performance timings, not the array operation itself. > > Generally, people who use array operations want them to be really fast. Well yes, of course, I thought you meant something more esoteric. I'm not suggesting that we replace the current design, simply extend it. We'd have: c[] = a[] + b[]; fast, in place array operation, the cost of allocation happens earlier in the code. but also c = a[] + b[]; a much slower, memory assigning array operation, pretty much just shorthand for c = new T[a.length]; c[] = a[] + b[]; You could argue that hiding an allocation is bad, but I would think it's quite obvious to any programmer that if you add 2 arrays together, you're going to have to create somewhere to put them... Having the shorthand prevents any possible mistakes with the length of the new array and saves a line of code. Anyway, this is a pretty trivial matter, I'd be more interested in seeing a definitive answer for what the correct behaviour for the statement a[] = b[] + c[] is when the arrays have different lengths.

Forums