Array types

Aug 26, 2010

bearophile

Aug 27, 2010

Norbert Nemec

Aug 27, 2010

Aug 27, 2010

Aug 29, 2010

Aug 29, 2010

August 26, 2010

Posted by bearophile

Permalink

bearophile

Permalink

Arrays are one of the most useful and most efficient data structures for nonfunctional languages. They look simple, but in a low-level language you sometimes need various kinds of them. So they are not so simple.

In D there are two kinds of built-in arrays:
A) 1D Dynamic arrays on the heap
B) Almost-nD rectangular fixed-sized arrays allocated on the stack by functions (or data segment), they may also go on the heap if they are part of an object (and with placement new they may go everywhere).

So far in D code I have had need of (and I have seen some people need) some other kinds of arrays, not (well) supported by D:
C) nD rectangular dynamic arrays;
D) Dynamic arrays allocated on the stack;
E) Fixed-sized arrays allocated on the heap;
F) Variable-sized structs allocated on the heap.

----------------------

The (C) arrays are not the same thing as dynamic arrays of dynamic arrays because:
- Some algorithms are not designed for a triangle where rows may differ in length. Testing that rows are all the same length at runtime wastes time, and if you don't test it then it may cause bugs;
- Sometimes you want to reshape a 2D matrix, so it's useful to store the matrix in contiguous memory;
- Sometimes you need complex slicing, for example removal of some columns and rows from a 2D matrix.

Recently in D.learn we have discussed this a bit.

In C# there are both arrays of arrays as in D, and rectangular dynamic arrays, that are manages with a syntax similar to Pascal arrays: arr[x,y,z]. I think there is no so need to put them into D. But I think it's important to have a official type of them in Phobos, for example implemented as a struct (only the item type and number of dimensions are known at compile time):

RectangularArray!(T, int ndimensions)

----------------------

The (D) arrays are sometimes useful in some high-performance routines. You can't implement them in Phobos. C99 has added them as VLA arrays. In bug 4357 I have proposed to use "scope", but that was not appreciated.

If you want (D) arrays you currently need to write something like (manually inlined code):

T[] arr = (cast(T*)alloca(n * T.sizeof))[0 .. n];
arr[] = T.init; // optional, but useful if T is a pointer or struct that contains pointers

This is not too much bad, and using a template mixin may help the situation.

D is designed to make simple the safe operations (like to create (A) arrays), and to make more elaborate the less safe operations. But in my opinion it's wrong to make something less safe than possible. Even if stack-allocated arrays are unsafe (if you escape their pointer/reference/fat pointer you will have bugs), there is no need to make _their creation too_ more unsafe/bug-prone than necessary.

A possibility is to use the C99 syntax, that is just allowing runtime variables when you define a "fixed-sized" array:

int n = 10;
int[n] arr;

That syntax is clean, readable, standard (it comes from C), easy to understand. A disadvantage is that visually it's not easy to tell apart a VLA from a normal fixed-sized array.

This is not allowed, because n is a runtime value:

int[n] foo(int n) {
  int[n] arr;
  return arr;
}

But this is allowed, and the arr needs to be copied on the heap:

int[] foo(int n) {
  int[n] arr;
  return arr;
}

I don't like this much, it adds some complexity and more cases.

----------------------

I have used (E) arrays while creating a deque, composed of fixed sized arrays on the heap. In this case their lengths are fixed and known at compile-time. In D I usually implement them wrapping them into a struct:

struct Foo(T, int SIZE) {
    T[SIZE] a;
}
void main() {
    auto f = new Foo!(int, 1024)();
}

Later I use the array as f.a[x]. This is not too much bad. Implementing (E) arrays in D gives problems syntax-wise, because in C you are free to use pointers too as arrays. So I think the current situation (using the wrapper struct) is good enough for the few situations were (E) arrays are useful.

----------------------

D supports zero-length fixed-sized arrays, useful to implement (F) "arrays" (but it seems LLVM does't support them, so in LDC they need 1 bit, I am not sure where such bit actually goes).

Variable-sized structs are not commonly used in high-level code, they are uncommon in general. So there is no need for D to support them as built-ins, they may be implemented manually by the programmer as need arise, also because they probably need to be quite customized for each situation.

On the other hand it's not too much bad to put into Phobos an intrusive Variable-sized struct type, something like:

struct VariableLengthStruct(T, Cargo) {
   Cargo cargo;
   size_t length;
   T[0] data;
   // operator methods
}

Where Cargo is the type of the intrusive payload, and VariableLengthStruct implements the opIndex method too that maps on the data attribute.

A helper function too may be added to help the allocation of a VariableLengthStruct.

In some situations VariableLengthStruct can't be used, because you need more custom structure, but I think that VariableLengthStruct is sometimes enough.

----------------------

The summary of this post is:
- It's positive to add efficient and _standard_ nD rectangular dynamic arrays to Phobos, as RectangularArray.
- Maybe a standard variable-sized struct is good in Phobos, like VariableLengthStruct.
- I'd still like syntax support for variable-sized stack-allocated arrays. But I don't like the extra complexity it seem to need. More thinking is needed about this.

Bye,
bearophile

Yippie-yeah! My favorite topic is being mentioned! I have to give a comment... :-) On 26/08/10 22:38, bearophile wrote: > [...] > So far in D code I have had need of (and I have seen some people need) some other kinds of arrays, not (well) supported by D: > C) nD rectangular dynamic arrays; > [...] > The (C) arrays are not the same thing as dynamic arrays of dynamic arrays because: > - Some algorithms are not designed for a triangle where rows may differ in length. Testing that rows are all the same length at runtime wastes time, and if you don't test it then it may cause bugs; > - Sometimes you want to reshape a 2D matrix, so it's useful to store the matrix in contiguous memory; > - Sometimes you need complex slicing, for example removal of some columns and rows from a 2D matrix. > > Recently in D.learn we have discussed this a bit. > > In C# there are both arrays of arrays as in D, and rectangular dynamic arrays, that are manages with a syntax similar to Pascal arrays: arr[x,y,z]. I think there is no so need to put them into D. But I think it's important to have a official type of them in Phobos, for example implemented as a struct (only the item type and number of dimensions are known at compile time): > > RectangularArray!(T, int ndimensions) I still have my doubt that rectangular arrays can be implemented to full power in the library. Currently, there is a number of essential languages missing to allow a sane syntax: * ranges 'A..B' as standalone expression * following from this: replacement of opSlice by opIndex overloaded for builtin type of range expression * extension of range expression by stride (e.g. 'A..B:C' ) * handling of opDollar '$' for multiple dimensions Beyond this, there are a few issues where I don't see that a library-based solution will ever have the full power of a builtin language feature: * array expressions with clean and easy to use syntax * syntax of type declarations: double[A,B] for static arrays and something like double[[3]] or double[:,:] for dynamic arrays are just far more straightforward and readable than any template-based syntax I could imagine. Ultimately, I think that rectangular arrays are as fundamental of numerical computing as hash-maps and strings for other areas. Moreover, there is a huge number of physicists working with Fortran95 simply because it is about as simple to use as Basic, without classes, let alone templates. The mere thought of having to touch classes or templates would deter many of them to consider D an option.

Norbert Nemec: >I have to give a comment... :-) Thank you. I see you talk about only one of the several types of arrays. > I still have my doubt that rectangular arrays can be implemented to full power in the library. I think all the features you list are *additive* changes for D language. So I don't consider them a problem. One of them, regarding opDollar, is even already scheduled for addition (I think). So I am not so much worried. What worries me are the (little) breaking changes, I have listed some possible ones in a very recent post. > Ultimately, I think that rectangular arrays are as fundamental of numerical computing as hash-maps and strings for other areas. But is D very interested in numerical computing? It has some interest in this field (see the care about Floating point matters. FP matters in D are not portable on very different CPUs as in Ada, but they are deep). You can't add all features in D2, so it needed to prioritize. And seeing how all the changes you suggest are additive ones, I think D2 designers have prioritized the right things :-) For numerical computing D has to learn from Chapel language, I think. It shows solutions that are general and look natural, they aren't little local single-trick hacks (as the .. range syntax of foreach). > Moreover, there is a huge number of physicists working with Fortran95 simply because it is about as simple to use as Basic, without classes, let alone templates. The mere thought of having to touch classes or templates would deter many of them to consider D an option. If some numeric-oriented features will be added to D3, those researchers will use already build abstractions (in the language, std lib or external libs), so they will have only limited need to use templates. Bye, bearophile

> ---------------------- > > The (C) arrays are not the same thing as dynamic arrays of dynamic arrays because: > - Some algorithms are not designed for a triangle where rows may differ in length. Testing that rows are all the same length at runtime wastes time, and if you don't test it then it may cause bugs; > - Sometimes you want to reshape a 2D matrix, so it's useful to store the matrix in contiguous memory; > - Sometimes you need complex slicing, for example removal of some columns and rows from a 2D matrix. > > Recently in D.learn we have discussed this a bit. > > In C# there are both arrays of arrays as in D, and rectangular dynamic arrays, that are manages with a syntax similar to Pascal arrays: arr[x,y,z]. I think there is no so need to put them into D. But I think it's important to have a official type of them in Phobos, for example implemented as a struct (only the item type and number of dimensions are known at compile time): > > RectangularArray!(T, int ndimensions) > Well this might be a good time to mention that I've started to get back to working on that std.matrix proposal for Phobos from a few months back. Unfortunately, I wasn't able to work on it for a few months (personal reasons). However, this may only solve the problem for numeric types. On the other hand, with strategic use of __traits and static if....

On 27/08/2010 9:38 a.m., bearophile wrote: > Arrays are one of the most useful and most efficient data structures for nonfunctional languages. They look simple, but in a low-level language you sometimes need various kinds of them. So they are not so simple. > > In D there are two kinds of built-in arrays: > A) 1D Dynamic arrays on the heap > B) Almost-nD rectangular fixed-sized arrays allocated on the stack by functions (or data segment), they may also go on the heap if they are part of an object (and with placement new they may go everywhere). > > So far in D code I have had need of (and I have seen some people need) some other kinds of arrays, not (well) supported by D: > C) nD rectangular dynamic arrays; > D) Dynamic arrays allocated on the stack; > E) Fixed-sized arrays allocated on the heap; > F) Variable-sized structs allocated on the heap. > > ---------------------- > > The (C) arrays are not the same thing as dynamic arrays of dynamic arrays because: > - Some algorithms are not designed for a triangle where rows may differ in length. Testing that rows are all the same length at runtime wastes time, and if you don't test it then it may cause bugs; > - Sometimes you want to reshape a 2D matrix, so it's useful to store the matrix in contiguous memory; > - Sometimes you need complex slicing, for example removal of some columns and rows from a 2D matrix. > Bearophile Have you got any numbers to back your claim for increased performance for nD rectangular dynamic arrays ? I too, am interested in them, but would like to use them with Tango, and therefore if there was a increase in performance, and therefore a real benefit, then would ask that they be added to the D language, so that all in the D community could benefit. Nick B

Nick B: > Have you got any numbers to back your claim for increased performance for nD rectangular dynamic arrays ? They often don't give a significant performance increase. They increase a bit cache coherence if you scan them by rows, and they waste a bit less memory, but usually this is not the purpose of their existence. (If their row length is always a power of two, you can find the item with a shift and sum, this sometimes is a bit faster than using an array of pointers or a product and sum. But this is not a common case, and unfortunately cache associativity decreases the efficiency of this case). Bye, bearophile

Forums