November 03, 2006
Chris Nicholson-Sauls wrote:
> Bill Baxter wrote:
> 
>> Sean Kelly wrote:
>>
>>> Walter Bright wrote:
>>>
>>>> Sean Kelly wrote:
>>>>
>>>>> I think I agree, but for the sake of argument, how would D handle 'find'  operations on sequences that don't support opIndex?  At some point, a generalizable bookmark is almost necessary for a faithful implementation of some C++ algorithms.  Also, many of these algorithms also require iteration across two sequences simultaneously, which doesn't map very cleanly into foreach.  Consider a substring-style find operation (std::search), for example, where both the pattern and target types do not support opIndex or opSlice.  I might argue that it's a bit silly or unrealistic to desire such an operation, but the C++ algorithm library does support it.
>>>>
>>>>
>>>>
>>>>
>>>> I think the target types will have to support opIndex or opSlice.
>>>
>>>
>>>
>>>
>>> Forgive me for resurrecting a horribly battered equine, but I was thinking about this a bit tonight and I've decided that foreach, foreach_reverse, and array operations are not a sufficient general substitute for C++ iterators.
>>>
>>> For example, let's say I want to compute the set union of two sorted ranges, one of size M, the other of size N.  By iterating across the two ranges simultaneously is is easy to see that the complexity for the operation will be max(M,N).  In C++ this is easily accomplished using forward iterators, and the ranges can be anything from arrays to SQL result sets.
>>>
>>> In D however, there is no way to iterate across multiple ranges simultaneously using foreach, so opIndex must be used instead.  With containers that have a constant-time lookup cost this isn't a big deal, but what if these containers are binary trees?  The complexity for such an operation would be M(log(M)) + N(log(N)).
>>>
>>>  From this it seems clear that foreach is not sufficiently adaptable to meet the needs of all sequential algorithms and opIndex is not a reasonable substitute for all cases where foreach may not be used.  What alternatives do we have?  So far, iterator variants are all I've been able to come up with.
>>
>>
>>
>> It would be sufficient if D had coroutines and opApply were implemented using them.
>>
>> The problem is that once you call an opApply function right now, it has to run to completion.  The only way I can see to have multiple opApply style iterators going simultaneously if you could truly suspend execution of one opApply temporarily and run another one for an iteration.
>>
>> In other words, if you're going to use the stack and function state to store iterator state, then the only way to have multiple iterators in different states simultaneously is to be able to have multiple stacks active simultaneously, aka coroutines.
>>
>> --bb
> 
> 
> Just a thought I had.  Would it be possible to emulate this behavior with delegate-as-aggregate?  What I'm thinking of (and bear in mind I just rolled out of bed, so I'm at quarter-speed right now ;)) is pass foreach a "driver" delegate that then calls step iterator delegates you've provided.  The driver and step iterators would be responsible for maintaining state, and would probably need a reset case (at least the iterators).
> 
> -- Chris Nicholson-Sauls

I was tinkering around with that.  But the problem is that any time you call an opApply you end up inside a function that basically does something like:

     for (i=0; i<length; i++) {
        ret = loop_body_dg(my_ith_item[i]);
     }

There's no way I can see to stop that juggernaut before it's done, short of having it return, and if it returns then it'll lose its state. Conceptually what you need is a 'yield' that keeps the current stack frame alive, but returns to the caller.  Then you could implement something like a zip() who's opApply calls the opApply's of each of its arguments, which in turn just run one iteration and each yield one value back to the zip().  Zip then packages up the results from one iteration of each of its args into a tuple and yields that back to the foreach block.

--bb
November 03, 2006
Sean Kelly wrote:

>  From this it seems clear that foreach is not sufficiently adaptable to meet the needs of all sequential algorithms and opIndex is not a reasonable substitute for all cases where foreach may not be used.  What alternatives do we have?  So far, iterator variants are all I've been able to come up with.

I agree, and I have been saying this countless times. :)

Why don't we come up with a standard iterator concept for D?
For a forward iterator, C++ has:

*i
i++ / ++i
i != end

In the post "Iterators (Was: Re: Lazy eval -- an example issue)" in digitalmars.D ( news://news.digitalmars.com:119/ecmvec$ksl$1@digitaldaemon.com ) I suggested a simple java-style variant:

*i.next()
i.hasNext()

A generic opApply could then be implemented as:

template opApplyMixin() {
  int opApply(int delegate(inout ValueType v) dg) {
    ValueType *t;
    while(hasNext()) {
      t = next();
      if (auto status = dg(*t))
        return status;
    }
    return 0;
  }
}

And a simple array iterator wrapper (Could of course be more efficiently implemented):

struct ArrayForwardIterator(T) {
  T[] arr;
  alias T ValueType;

  T* next() { arr = arr[1..$]; return arr.ptr-1; }
  bool hasNext() { return arr.length > 0; }
  mixin opApplyMixin;
}

If pointers should be avoided, and since D doesn't have an opDeref, but has lhs/rhs overloads for indexing, another iterator variant could be to use

i[]
i[]=

for rhs, and lhs iterator dereferencing.

i++ / ++i could be used instead of .next() if it is any clearer, and so on...

It doesn't really matter what convention we pick. For better rather than worse, D doesn't allow us to do as C++ does, make iterators interchangeable with pointers. We may as well pick anything we like as long as the compiler is able to make reasonable efficient code out of it.

As I said in the above referenced post, I have been toying with making a template iterator/array algorithm library. I've got a quite substantial proof of concept working, but have not had much time lately. I do love discussions about this though. :)

/Oskar
November 04, 2006
Walter Bright wrote:
> Sean Kelly wrote:
> 
>> Forgive me for resurrecting a horribly battered equine, but I was thinking about this a bit tonight and I've decided that foreach, foreach_reverse, and array operations are not a sufficient general substitute for C++ iterators.
>>
>> For example, let's say I want to compute the set union of two sorted ranges, one of size M, the other of size N.  By iterating across the two ranges simultaneously is is easy to see that the complexity for the operation will be max(M,N).  In C++ this is easily accomplished using forward iterators, and the ranges can be anything from arrays to SQL result sets.
>>
>> In D however, there is no way to iterate across multiple ranges simultaneously using foreach, so opIndex must be used instead.  With containers that have a constant-time lookup cost this isn't a big deal, but what if these containers are binary trees?  The complexity for such an operation would be M(log(M)) + N(log(N)).
>>
>>  From this it seems clear that foreach is not sufficiently adaptable to meet the needs of all sequential algorithms and opIndex is not a reasonable substitute for all cases where foreach may not be used.  What alternatives do we have?  So far, iterator variants are all I've been able to come up with.
> 
> You are correct in regards to looping across multiple aggregates simultaneously.
> 
> On the other hand, C++ iterators aren't a general substitute for opApply. For example, it is not at all easy to turn a recursive data structure (i.e. binary tree) into a linearized iterator.

If I'm not missing the essence here, there are several kinds of iterators, and not every data structure should try to provide all of these. For example, an iterator into a tree structure might be breadth-first or depth-first, but one wouldn't expect a linearized (if I understand the term in this context at all) iterator to exist. Conversely, an array is hardly thinkable without a linear iterator.

What Stepanov explained to me at length (back when I was studying CS and there was not a single book published on the STL) was that the different containers all have their specific sets of "natural" iterators, and that only where it is convenient or at all implementable, other iterators may exist to them. And that it's not like you have a container and then need a certain iterator to it, but that you choose the container based on what kind of iterators you need in the first place.

Then it's about the iterator just being a way to reach every item exactly once. And if you don't seem to get them in the order you want, then you sort the items, say into a list. (Or rethink the problem.)

Back to foreach: it's syntactically a very convenient way to go through a trivial collection (i.e. an array or a list). But for non-trivial containers (or data structures), foreach simply is insufficient. You need an iterator. Iterators were invented for the very purpose of walking through a black-box container, if you will. I'd almost say that container-specific iterators are the only way to guarantee the lowest complexity walk-through of an arbitrary black-box container.

That said, it is of course customary to provide "wrong" iterators to containers (where at all possible, as conveniences), for example a random access iterator to a linked list. But then the user should stay aware that using such is only for the cases where the choice of container has already been evaluated to be optimal, and where one only occasionally needs to access some of the items in a specific way. In other words, where this need is seldom enough to not warrant a temporary copy (or linking) of the items into a more suitable container.

One could of course have foreach actually employ iterators. This gives the ease of coding with foreach combined with the flexiblity, decoupling and orthogonality offered by iterators.

But, as in the previous post, even this becomes cumbersome when we're iterating over more than one container simultaneously, especially when they're of different sizes or iterating in non-lockstep.

---

In general, the essence and elegance of the STL has not been too prevalent in D libraries or algorithms. With the current expressive power of D, one might expect to soon start seeing some of this.
November 04, 2006
Bill Baxter wrote:
> There's no way I can see to stop that juggernaut before it's done, short of having it return, and if it returns then it'll lose its state. Conceptually what you need is a 'yield' that keeps the current stack frame alive, but returns to the caller.

But then your object can only have one iterator at a time, and what if you want to have multiple simultaneous iterators for any given object? The 'yield' keyword can only keep track of state for a single iterator, so this kind of code would fail:

  foreach (Item outerItem; collection) {
    foreach (Item innerItem; collection) {
      doSomething(outerItem, innerItem);
    }
  }

This should be semantically identical to:

  for (int i = 0; i < collection.length; i++) {
    Item outerItem = collection[i];
    for (int j = 0; j < collection.length; j++) {
      Item innerItem = collection[j];
      doSomething(outerItem, innerItem);
    }
  }

But in the 'foreach' example, the 'yield' implementation would get confused (as would the opApply method).

If iteration was implemented with iterators (instead of with operator overloading and delegates), and if iterators were implemented as nested classes within the container class definitions, then iterators could access the elements directly, while still keeping their own state information. (Such as a pointer to the next iterable tree-node, in a InOrderTraversalIterator. Or whatever.)

For example:

class ArrayList {
  private Item[] items;
  private Iterable getForwardIterator() { ... }
  class ForwardIterator : Iterable {
    private Item* nextItem; // Keeps track of state
    public bool hasNext() { ... }
    public Item next() { ... }
  }
}

And then I'd iterate through my items like this:

  foreach (Item outerItem; collection.getForwardIterator()) {
    foreach (Item innerItem; collection.getForwardIterator()) {
      doSomething(outerItem, innerItem);
    }
  }

In most cases, if your iterator can keep track of its own state, and if the iterator is a nested class with access to all of its containing class's members, then the 'yield' keyword is unnecessary.

(The most compelling argument for 'yield' is not for the implementation of iterators, but for building generators.)

But please, please reconsider the foreach * foreach_reverse implementations. Forcing a collection class to be aware of its own iterables is fugly as can be.

--benji
November 04, 2006
Benji Smith wrote:
> Bill Baxter wrote:
> 
>> There's no way I can see to stop that juggernaut before it's done, short of having it return, and if it returns then it'll lose its state. Conceptually what you need is a 'yield' that keeps the current stack frame alive, but returns to the caller.
> 
> 
> But then your object can only have one iterator at a time, and what if you want to have multiple simultaneous iterators for any given object? The 'yield' keyword can only keep track of state for a single iterator, so this kind of code would fail:
> 
>   foreach (Item outerItem; collection) {
>     foreach (Item innerItem; collection) {
>       doSomething(outerItem, innerItem);
>     }
>   }

No it wouldn't.  Not if you had support for real coroutines.  That's the whole point of coroutines.  You can have multiple stack frames active at the same time.   Each stack frame for the opApply's in the above case would have its own local iterator state.

Now whether or not coroutines are realistically implementable in a systems programming language like D is beyond me.  Stackless Python folks did it, but perhaps they're taking advantage of the fact that users of interpreted code are a bit more forgiving about runtime overhead, and the fact that the interpreter has its own notion of the stack that doesn't depend on low-level processor details like EAX registers etc.  But technically I don't see any roadblocks.  I mean I read somewhere that coroutines are an old assembly programming trick, so the hardware is certainly capable of supporting the paradigm at the low level.  It's basically just a jmp plus a manipulation of the stack pointer when it comes right down to it.

--bb
November 04, 2006
Georg Wrede wrote:
> Walter Bright wrote:
>> On the other hand, C++ iterators aren't a general substitute for opApply. For example, it is not at all easy to turn a recursive data structure (i.e. binary tree) into a linearized iterator.
> 
> If I'm not missing the essence here, there are several kinds of iterators, and not every data structure should try to provide all of these. For example, an iterator into a tree structure might be breadth-first or depth-first, but one wouldn't expect a linearized (if I understand the term in this context at all) iterator to exist. Conversely, an array is hardly thinkable without a linear iterator.

Given the structure:

class Node
{	Node left;
	Node right;
	...data...
}

how do you iterate that with an iterator? This is what I mean by 'linearizing' it. The STL sidesteps the problem by avoiding providing such a container.
November 04, 2006
Oskar Linde wrote:
> Sean Kelly wrote:
> 
>>  From this it seems clear that foreach is not sufficiently adaptable to meet the needs of all sequential algorithms and opIndex is not a reasonable substitute for all cases where foreach may not be used.  What alternatives do we have?  So far, iterator variants are all I've been able to come up with.
> 
> 
> I agree, and I have been saying this countless times. :)
> 
> Why don't we come up with a standard iterator concept for D?
> For a forward iterator, C++ has:
> 
> *i
> i++ / ++i
> i != end

One big concern I have with the current iterators is that the approach supported for the general case is deemed too inefficient for built-in arrays.  So arrays are special-cased by the compiler in the foreach logic to become more like a simple for loop.  But I want that speed for foreach-ing my own classes too!  Why should I be satisfied with less that peak performance for *my* classes if it's not considered good enough for built-in classes?

I have to pay this penalty even if my class is a simple wrapper around an array, that say does nothing more than add an opAdd so I can add two arrays together.


Plain old C++-style iterators don't have that problem.  For simple cases iter++  bascially compiles down to be equivalent to pointer manipulation.

--bb
November 04, 2006
Bill Baxter wrote:
> Benji Smith wrote:
>> Bill Baxter wrote:
>>
>>> There's no way I can see to stop that juggernaut before it's done, short of having it return, and if it returns then it'll lose its state. Conceptually what you need is a 'yield' that keeps the current stack frame alive, but returns to the caller.
>>
>>
>> But then your object can only have one iterator at a time, and what if you want to have multiple simultaneous iterators for any given object? The 'yield' keyword can only keep track of state for a single iterator, so this kind of code would fail:
>>
>>   foreach (Item outerItem; collection) {
>>     foreach (Item innerItem; collection) {
>>       doSomething(outerItem, innerItem);
>>     }
>>   }
> 
> No it wouldn't.  Not if you had support for real coroutines.  That's the whole point of coroutines.  You can have multiple stack frames active at the same time.   Each stack frame for the opApply's in the above case would have its own local iterator state.
> 
> Now whether or not coroutines are realistically implementable in a systems programming language like D is beyond me.  Stackless Python folks did it, but perhaps they're taking advantage of the fact that users of interpreted code are a bit more forgiving about runtime overhead, and the fact that the interpreter has its own notion of the stack that doesn't depend on low-level processor details like EAX registers etc.  But technically I don't see any roadblocks.  I mean I read somewhere that coroutines are an old assembly programming trick, so the hardware is certainly capable of supporting the paradigm at the low level.  It's basically just a jmp plus a manipulation of the stack pointer when it comes right down to it.
> 
> --bb

I would direct your attention to StackThreads:

http://assertfalse.com/articles/stFAQ.shtml

Pyd uses this nifty package to wrap a class's opApply function with Python's iteration interface. (Which uses iterator objects with a .next() method.) The magic may be found in this bit of code:

http://www.dsource.org/projects/pyd/browser/trunk/infrastructure/pyd/iteration.d

-- 
Kirk McDonald
Pyd: Wrapping Python with D
http://pyd.dsource.org
November 04, 2006
Bill Baxter wrote:
> Now whether or not coroutines are realistically implementable in a systems programming language like D is beyond me.  Stackless Python folks did it, but perhaps they're taking advantage of the fact that users of interpreted code are a bit more forgiving about runtime overhead, and the fact that the interpreter has its own notion of the stack that doesn't depend on low-level processor details like EAX registers etc.  But technically I don't see any roadblocks.  I mean I read somewhere that coroutines are an old assembly programming trick, so the hardware is certainly capable of supporting the paradigm at the low level.  It's basically just a jmp plus a manipulation of the stack pointer when it comes right down to it.

It's not so easy to implement in the general case (asm code takes advantage of special, easy cases). To make it work, you need to allocate the stack frame of the coroutine on the heap. Not impossible, but a lot of work involved.
November 04, 2006
You're right about that being a problem with opApply.