September 09, 2008
(This is an older message that somehow didn't make it to the group. Resending now.)

Lionello Lunesu wrote:
> 
> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:ga46ok$2s77$1@digitalmars.com...
>> I put together a short document for the range design. I definitely missed about a million things and have been imprecise about another million, so feedback would be highly appreciated. See:
>>
>> http://ssli.ee.washington.edu/~aalexand/d/tmp/std_range.html
> 
> This is just awesome. Thank you for tackling this issue.
> 
> I agree with others that some names are not so obvious. Left/right? How do Arabic speakers feel about this : ) Begin/end seems more intuitive.

I don't know of that particular cultural sensitivity. Begin and end are
bad choices because they'd confuse the heck out of STL refugees. c.left
and c.right are actually STL's c.front() and c.back() or *c.begin() and
c.end()[-1], if you allow the notational abuse. But I sure hope some
good names will come along.

> Can you explain this please. From Input range:
> 
> e=r.getNext Reads an element and moves to the next one. Returns the read element, which is of type ElementType!(R). The call is defined only right after r.isEmpty returned false.
> 
> That last part: The call is defined only right after r.isEmpty returned false.
> 
> First of all, is* functions always sound "const" to me, but the way you describe isEmpty it sounds like it actually changes something, advancing a pointer or something like that. What happens if isEmpty is called twice? Will it skip 1 element?

Excellent question! Gosh if I staged this it couldn't have gone any better.

Consider an input range getting ints separated by whitespace out of a
FILE* - something that we'd expect our design to allow easily and
neatly. So then how do I implement isEmpty()? Well, feof(f) is not
really informative at all. Maybe the file has five more spaces and then
it ends. So in order to TELL you that the range is not empty, I actually
have to GO all the way and actually read the integer. Then I can tell
you: yeah, there's stuff available, or no, I'm done, or even throw an
exception if some error happened.

That makes isEmpty non-const. You check for r.isEmpty, it makes sure an
int is buffered inside the range's state. You call r.isEmpty again, it
doesn't do anything because an int is already buffered. You call
r.getNext, the int gets moved off the range's state into your program,
and the internal flag is set telling the range that the buffer's empty.

You call r.getNext without having called r.isEmpty, then the range makes
a heroic effort to fetch another int. If that fails, the range throws an
exception. So in essence the behavior is that you can use isEmpty to
make sure that getNext won't blow in your face (speaking of Pulp
Fiction...) I think this all is very sensible, and even more sensible
when I'll give more detail below.

> The input range behaves like C#'s IEnumerator, but at least the C# names are more obvious: while (i.MoveNext()) e = i.Current; But isEmpty is common to all ranges, so I understand why it's the way it is. I just hope it could stay "const", not modifying the internal state. Perhaps add "next" to input ranges as well?

This is even better than the previous question. Why not this for an
input iterator?

for (; !r.isEmpty; r.next) use(r.getNext);

or even

for (; !r.isEmpty; r.next) use(r.left);

thus making input ranges quite similar to forward ranges.

In that design:

a) The constructor fetches the first int

b) isEmpty is const and just returns the "available" internal flag

c) next reads the next int off the file

First I'll eliminate the design with isEmpty/next/getNext as flawed for
a subtle reason: cost of copying. Replace mentally the int above with
something that needs dynamic allocation, such as BigInt. So the range
reads one BigInt off the file, stores it in the internal buffer, and
then the user calls:

for (; !r.isEmpty; r.next)
{
    BigInt my = r.getNext();
    ....
}

Since in one iteration there's only one BigInt, not two, I'd need to do
a destructive transfer in getNext() that "moves" the state of BigInt
from the range to my, leaving r's state empty. (This feature will be
available in D2 soon.) But then what if somebody calls r.getNext()
again? Well I don't have the data anymore, so I need to issue a next().
So I discover next was not needed in the first place.

Hope everything is clear so far. Now let's discuss the isEmpty/next/left
design.

That design is also flawed, just in a different way. The range holds the
BigInt inside, but r.left gives to the client a *reference* to it. This
is cool! There is no more extra copying and everything works smoothly.

In fact this design patents a lie. It lies about giving a reference to
an element (the same way a "real" container does) but that element will
be overwritten every time r.next is called, UNBEKNOWNST TO THE CLIENT.
So consider the algorithm:

void bump(R)(R r)
{
    for (; !r.isEmpty(); r.next) ++r.left;
}

You pass an input range to bump and it compiles and executes to
something entirely nonsensical. This is an obvious misuse, but as I'm
sure you know it gets real confusing real fast.

A possible argument is to have left return a ref const(BigInt), but then
we lose the ability to transfer its value into our state.

What we want is a design that tells the truth. And a design that tells
the truth is this:

r.isEmpty does whatever the hell it takes to make sure whether there's
data available or not. It is not const and it could throw an exception.

v = r.getNext returns BY VALUE by means of DESTRUCTIVE COPY data that
came through the wire, data that the client now owns as soon as getNext
returned. There is no extra copy, no extra allocation, and the real
thing has happened: data has been read from the outside and user code
was made the only owner of it.


Andrei
September 09, 2008
On Tue, 09 Sep 2008 04:37:41 +0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> Denis Koroskin wrote:
>> 1) There is a typo:
>>  // Copies a range to another
>> void copy(R1, R2)(R1 src, R2 tgt)
>> {
>>     while (!src.isEmpty)
>>     {
>>         tgt.putNext(r.getNext);  // should be tgt.putNext(src.getNext);
>>     }
>> }
>
> Thanks! Fixed.
>
>> 2) R.next and R.pop could have better names. I mean, they represent similar operations yet names are so different.
>
> I agree. Next was a natural choice. I stole pop from Perl. Any symmetric and short operation names would be welcome.
>

1) R.left += n / R.right -= n
2) R.left.advance(n) / R.right.advance(n) (or move)
3) R.advanceLeft(n)/R.advanceRight(n) (or moveLeft/moveRight)

>> 3) Walter mentioned that built-in array could be re-implemented using a pair of pointers instead of ptr+length. Will it ever get a green light? It fits range concept much better.
>
> Walter told me to first implement my design, and if it works, he'll do the change. Yes, it does fit ranges much better because the often-used next and, um, pop will only touch one word instead of two.
>
>> 4) We need some way of supporting dollar notation in user containers. The hack of using __dollar is bad (although it works).
>
> It doesn't work for multiple dimensions. There should be an opDollar(uint dim) that gives the library information on which argument count it occured in. Consider:
>
> auto x = matrix[$-1, $-1];
>
> Here the dollar's occurrences have different meanings. A good start would be to expand the above into:
>
> auto x = matrix[matrix.opDollar(0)-1, matrix.opDollar(1)-1];
>
>> 5) I don't quite like names left and right! :) I think they should represent limits (pointers to begin and end, in case of array) rather that values. In this case, built-in arrays could be implemented as follows:
>>  struct Array(T)
>> {
>>     T* left;
>>     T* right;
>>     size_t length() { return right-left; }
>>     ref T opIndex(size_t index) { return left[index]; }
>>     // etc
>> }
>>  The rationale behind having access to range limits is to allow operations on them. For example,
>> R.left-=n;
>
> I disagree. Defining operations on range limits opens a box that would make Pandora jealous:
>
> 1. What is the type of left in general? Um, let's define Iterator!(R) for each range R.
>
Yes, it should be iterator.

> 2. What are the primitives of an iterator? Well, -= sounds good. How do you check it for correctness? In fact, how do you check any operation of a naked iterator for correctness?
>

First of all, left is a forward iterator and right is a backward iterator.
That's why left might support ++, += and = while right support --, -= and =.
Note that while we can rename ++ to advance() to get uniform naming.

In many cases it is desirable to store an iterator and set an iterator to the range bounds.

> 3. I want to play with some data. What should I use here, ranges or iterators? ...
>

I don't think that ranges replace iterators (yet?). I define range as a pair of iterators to myself.

For example, what is the equivalent D code of the following:

for (iterator it = ..., end = list.end(); it != end; ) {
    if (predicate(*it)) {
        it = listerase(it);
    } else {
        ++it;
    }
}

Range range = vector.all;
while (!range.isEmpty) {
    if (predicate(range.left)) {
        ???
    } else {
        range.next;
    }
}

Iterator solution would be as follows:
range.left = list.erase(range.left);

I took a list for the sake of simlicity, because erasing an element doesn't update an end. However, in general, both left and right should be updated. A good solution would be to return a new range, like this:

Range range = ...;
range = container.erase(???); // but what should be here?

maybe this:
container.erase(range); // erase the whole range, no need to return anything because it would be empty anyway
container = container.eraseFirtsN(range, n); // erase first n elements from a range, and return the rest.
container = container.eraseLastN(range, n);  // the same but for other end

I don't say that iterators magically solve everyrthing, I merely try to find problematic places.

> Much of the smarts of the range design is that it gets away WITHOUT having to answer embarrassing questions such as the above. Ranges are rock-solid, and part of them being rock-solid is that they expose enough primitives to be complete, but at the same time do not expose dangerous internals.
>
>> could be used instead of
>> foreach(i; 0..n) {
>>     R.pop();
>> }
>>  which is more efficient in many cases.
>
> Stop right there. That's not a primitive. It is an algorithm that gets implemented in terms of a primitive. I disagree that such an algorithm is an operator and does not have a name such as popN.
>
Okay, but you didn't mention popN() in the paper :)

>> Other that that - great, I like it.
>
> Thanks for your comments.
>
>
> Andrei

Thank YOU!
September 09, 2008
Andrei Alexandrescu Wrote:

> What we want is a design that tells the truth. And a design that tells the truth is this:
> 
> r.isEmpty does whatever the hell it takes to make sure whether there's data available or not. It is not const and it could throw an exception.
> 
> v = r.getNext returns BY VALUE by means of DESTRUCTIVE COPY data that came through the wire, data that the client now owns as soon as getNext returned. There is no extra copy, no extra allocation, and the real thing has happened: data has been read from the outside and user code was made the only owner of it.

this is really kool n the gang. there's a sore point tho. if i wanna read strings from a file no prob.

for (auto r = stringSucker(stdin); !r.isEmpty(); )
{
    string s = r.getNext();
    // play with s
}

but a new string is allocated every line. that's safe but slow. so i want some more efficient stuff. i should use char[] because string don't deallocate.

for (auto r = charArraySucker(stdin); !r.isEmpty(); )
{
    char[] s = r.getNext();
    // play with s
}

no improvement. same thing a new char[] is alloc each time. maybe i could do

for (auto r = charArraySucker(stdin); !r.isEmpty(); )
{
    char[] s = r.getNext();
    // play with s
    delete s;
}

would this make stuff faster. maybe. maybe not. and it's not general. what i'd like is some way of telling the range, i'm done with this you can recycle and reuse it. it's a green green world.

for (auto r = charArraySucker(stdin); !r.isEmpty(); )
{
    char[] s = r.getNext();
    // play with s
    r.recycle(s);
}

sig is recycle(ref ElementType!(R)).
September 09, 2008
Andrei Alexandrescu, el  8 de septiembre a las 19:24 me escribiste:
> 1.5) Allow object ownership but also make the behavior of incorrect code well-defined so it can be reasoned about, reproduced, and debugged.
> 
> That's why I think an Array going out of scope should invoke destructors for its data, and then obliterate contents with ElementType.init. That way, an Array!(File) will properly close all files AND put them in a "closed" state. At the same time, the memory associated with the array will NOT be deallocated, so a range surviving the array will never crash unrelated code, but instead will see closed files all over.

Why is so bad that the program crashes if you do something wrong? For how long you will have the memory "alive", it will use "regular" GC semantics (i.e., when nobody points at it anymore)? In that case, leting the programmer leave dangling pointers to data that should be "dead" without crashing, wouldn't make easier to introduce memory leaks?

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Did you know the originally a Danish guy invented the burglar-alarm unfortunately, it got stolen
September 09, 2008
Leandro Lucarella Wrote:

> Andrei Alexandrescu, el  8 de septiembre a las 19:24 me escribiste:
> > 1.5) Allow object ownership but also make the behavior of incorrect code well-defined so it can be reasoned about, reproduced, and debugged.
> > 
> > That's why I think an Array going out of scope should invoke destructors for its data, and then obliterate contents with ElementType.init. That way, an Array!(File) will properly close all files AND put them in a "closed" state. At the same time, the memory associated with the array will NOT be deallocated, so a range surviving the array will never crash unrelated code, but instead will see closed files all over.
> 
> Why is so bad that the program crashes if you do something wrong?

it's not bad. it's good if it crashes. problem is when it don't crash and continues running on oil instead of gas if you see what i mean.

> For how
> long you will have the memory "alive", it will use "regular" GC semantics
> (i.e., when nobody points at it anymore)? In that case, leting the
> programmer leave dangling pointers to data that should be "dead" without
> crashing, wouldn't make easier to introduce memory leaks?

such is the peril of gc. clearly meshing scoping with gc ain't gonna be perfect. but i like the next best thing. scarce resources are deallocated quick. memory stays around for longer. no dangling pointers.
September 09, 2008
On Tue, 09 Sep 2008 07:06:55 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> Robert Jacques wrote:
>> On Mon, 08 Sep 2008 23:53:17 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> Robert Jacques wrote:
>>>> On Mon, 08 Sep 2008 20:37:41 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>>>> Denis Koroskin wrote:
>>>>>> 3) Walter mentioned that built-in array could be re-implemented using a pair of pointers instead of ptr+length. Will it ever get a green light? It fits range concept much better.
>>>>>
>>>>> Walter told me to first implement my design, and if it works, he'll do the change. Yes, it does fit ranges much better because the often-used next and, um, pop will only touch one word instead of two.
>>>>  I'd warn that changing away from ptr+length would create logical incosistencies between 1D arrays and 2D/3D/ND arrays.
>>>
>>> How so?
>>  An ND array is typically defined as a fat pointer like so:
>> struct array(T,size_t N) {
>>     T* ptr;
>>     size_t[N]    lengths;      // of each dimension
>>     ptrdiff_t[N] byte_strides; // of each dimension
>> }
>> So a 1D array is
>> {
>>     T* ptr;
>>     size_t lengths;
>>     ptrdiff_t byte_strides = T.sizeof; //Currently a compile time constant in the built-in array
>>     size_t length() { return lengths; }
>> }
>> which is logically consistent with a general dense matrix and aside from some name change and the stride being a compile time constant, is identical to the current D arrays.
>> However, { T* first; T* last } may not be logically extended to ND arrays, particularly sliced ND arrays, as T* last no longer has any meaning.
>
> Hmmm, I see. That could become a problem if we wanted lower-dimensional matrices to be prefixes of higher-dimensional matrices. This is a worthy goal, but one that my matrices don't pursue.
>
>>>>>> 4) We need some way of supporting dollar notation in user containers. The hack of using __dollar is bad (although it works).
>>>>>
>>>>> It doesn't work for multiple dimensions. There should be an opDollar(uint dim) that gives the library information on which argument count it occured in. Consider:
>>>>>
>>>>> auto x = matrix[$-1, $-1];
>>>>>
>>>>> Here the dollar's occurrences have different meanings. A good start would be to expand the above into:
>>>>>
>>>>> auto x = matrix[matrix.opDollar(0)-1, matrix.opDollar(1)-1];
>>>>  I'd also add that multiple dimension slicing should be supported. i.e.
>>>> auto x = matrix[2..5,0..$,3]
>>>> would become
>>>> auto x = matrix.opSlice(Slice!(size_t)(2,5),Slice!(size_t)(0,matrix.opDollar(0)),3) with
>>>> struct Slice (T) { T start; T end; }
>>>> Strided slices would also be nice. i.e. matrix[0..$:10] // decimate the array
>>>
>>> Multidimensional slicing can be implemented with staggered indexing:
>>>
>>> matrix[2..5][0..$][3]
>>  Yes, but doing so utilizes expression templates and is relatively slow:
>> matrix_row_slice temp1 = matrix.opSlice(2,5);
>> matrix_col_slice temp2 = temp1.opSlice(0,$);
>> matrix                 = temp2.opIndex(3);
>>  And causes code bloat. Worst matrix[2..5] by itself would be an unstable type. Either foo(matrix[2..5]) would not compile or it would generate code bloat and hard to find logic bugs. (Due to the fact that you've embedded the dimension of the slice operation into the type).
>
> What is an unstable type?

What I meant, is that the type is fundamentally not designed to exist by itself. And therefore if not paired with the other slices, you've put your program into an a danger state.

> There is no use of expression templates,

So what would you call embedding an operation into a very temporary type that's not expected to last beyond the line of it's return?

> but indeed multiple slices are created. This isn't as bad as it seems because the desire was to access several elements, so the slice is supposed to be around for long enough to justify its construction cost.

True, but it's still less efficient.

> I agree it would be onerous to access a single element with e.g. matrix[1][1][2].

And what about the code bloat and compile time or runtime logic bugs that this design is prone to produce?

>>> means: first, take a slice 2..5 that returns a matrix range one dimension smaller. Then, for that type take a slice from 0 to $. And so on.
>>>
>>> This works great for row-wise storage. I'm not sure how efficient it would be for other storage schemes.
>>  No it doesn't. It works great for standard C arrays of arrays, but these are not matrices and have a large number of well documented performance issues when used as such. In general, multi-dimentional data structures relatively common and should be cleanly supported.
>
> [Citation needed]

Imperfect C++: Practical Solutions for Real-Life Programming by Matthew Wilson has an entire chapter dedicated to the performance of matrices in C++, comparing Boost, the STL, plain arrays and a few structures of his own. Also, arrays of arrays don't support O(1) slicing, resize and creation, use more memory, etc.

As for multi-dimentional data structures, they are used heavily by the fields of games, graphics, scientific computing, databases and I've probably forgotten some others.

September 09, 2008
Andrei Alexandrescu, el  9 de septiembre a las 07:47 me escribiste:
> Lionello Lunesu wrote:
> >"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:ga46ok$2s77$1@digitalmars.com...
> >>I put together a short document for the range design. I definitely missed about a million things and have been imprecise about another million, so feedback would be highly appreciated. See:
> >>
> >>http://ssli.ee.washington.edu/~aalexand/d/tmp/std_range.html
> >This is just awesome. Thank you for tackling this issue.
> >I agree with others that some names are not so obvious. Left/right? How do
> >Arabic speakers feel about this : ) Begin/end seems more intuitive.
> 
> I don't know of that particular cultural sensitivity. Begin and end are
> bad choices because they'd confuse the heck out of STL refugees. c.left
> and c.right are actually STL's c.front() and c.back() or *c.begin() and
> c.end()[-1], if you allow the notational abuse. But I sure hope some
> good names will come along.

I think STL refugees can deal with it. I think there is no point on keep compromising D's readability because of C/C++ (you just mentioned enum, another really bad choice to keep C/C++ refugees happy).

I find left and right a little obscure too, it all depends on your mental image of a range. Using front/back or begin/end is much more clearer.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
MP: Qué tengo?                     B: 2 dedos de frente.
MP: No, en mi mano, Bellini.       B: Un acoplado!
MP: No, escuche bien, eh...        B: El pelo largo, Mario...
    Se usa en la espalda.
MP: No! Es para cargar.            B: Un hermano menor.
MP: No, Bellini! Se llena con      B: Un chancho, Mario...
    cualquier cosa.
MP: No, Bellini, no y no!
	-- El Gran Bellini (Mario Podestá con una mochila)
September 09, 2008
Andrei Alexandrescu, el  8 de septiembre a las 22:43 me escribiste:
> I like the alternate names quite some. One thing, however, is that head and rear are not near-antonyms (which ideally they should be). Maybe front and rear would be an improvement. (STL uses front and back). Also,

What about head/tail? You certainly won't confuse STL refugees and functional guys will be at home ;)

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
PROTESTA EN PLAZA DE MAYO: MUSICO SE COSIO LA BOCA
	-- Crónica TV
September 09, 2008
Leandro Lucarella wrote:
> Andrei Alexandrescu, el  8 de septiembre a las 22:43 me escribiste:
>> I like the alternate names quite some. One thing, however, is that head
>> and rear are not near-antonyms (which ideally they should be). Maybe
>> front and rear would be an improvement. (STL uses front and back). Also,
> 
> What about head/tail? You certainly won't confuse STL refugees and
> functional guys will be at home ;)

You'll sure confuse the latter. To them, tail is everything except the head, e.g. a[1 .. $].

Andrei
September 09, 2008
Andrei Alexandrescu wrote:
> What's the deal with destroyed but not deleted? Consider:
> 
> int[] a;
> if (condition) {
>    Array!(int) b;
>    a = b.all;
> }
> writeln(a);
> 
> This is a misuse of the array in that a range crawling on its back has survived the array itself. What should happen now? Looking at other languages:
> 
> 1) All Java objects are unowned, meaning the issue does not appear in the first place, which is an advantage. The disadvantage is that scarce resources must be carefully managed by hand in client code.
> 
> 2) C++ makes the behavior undefined because it destroys data AND recycles memory as soon as the array goes out of scope. Mayhem ensues.
> 
> We'd like:
> 
> 1.5) Allow object ownership but also make the behavior of incorrect code well-defined so it can be reasoned about, reproduced, and debugged.
> 
> That's why I think an Array going out of scope should invoke destructors for its data, and then obliterate contents with ElementType.init. That way, an Array!(File) will properly close all files AND put them in a "closed" state. At the same time, the memory associated with the array will NOT be deallocated, so a range surviving the array will never crash unrelated code, but instead will see closed files all over.
> 

I don't think this is a good thing, for reasons similar to the Error/Exception flaw - specifically, code that works in debug mode might end up failing in release mode.

To explain what I mean by Error/Exception flaw, consider the case of an array out of bounds error, wrapped carelessly in a try .. catch (Exception) block.

This would work fine in debug mode, and presumably retry the operation until it succeeded.

In release mode, however, the above would crash.

This is clearly undesirable, and arises directly from the fact that Error is derived from Exception, not the other way around or completely separate (as it clearly should be).

After all, an Error ![is] an Exception, since Exceptions are clearly defined as recoverable errors, and the set of unrecoverable errors is obviously not a subset of the recoverable ones.

This leads to my actual point: I suggest an extension of .init: the .fail state, indicating data that should not be accessed.

Any standard library function that encounters data that is intentionally in the .fail state should throw an Error.

For instance, the .fail state for strings could be a deliberately invalid UTF8 sequence.

When this state could reasonably come up in normal operations, it is recommended to use values that will readily be visible in a debugger, such as the classical 0xDEADBEEF.

This is imnsho superior to using .init to fill this memory, which doesn't tell the debugging programmer much about what exactly happened, and furthermore, might cause the program to treat "invalid" memory the same as "fresh" memory, if only by accident.

 --downs