July 31, 2009

> Andrei Alexandrescu wrote:
>> It would be great if you could contribute to Phobos. Two things I hope
>> from any replacement (a) works with ranges and ideally outputs ranges,
>> (b) uses alias functions instead of delegates if necessary.

There's really only one sane way to map XML parsing to ranges: pull parsing, which is more or less already a range.  For those unfamiliar with it, this is how you use Tango's pull parser right now:

auto pp = new PullParser!(char)(xmlSource);

for( auto tt = pp.next; tt != XmlTokenType.Done; tt = pp.next )
{
    switch( tt )
    {
        case XmlTokenType.Attribute: ... break;
        case XmlTokenType.CData: ... break;
        case XmlTokenType.Comment: ... break;
        case XmlTokenType.Data: ... break;
        ...
        case XmlTokenType.StartElement: ... break;
        default: assert(false, "wtf?");
    }
}

This would fairly naturally map to a range of parsing events and look something like:

foreach( event ; new PullParser!(char)(xmlSource) )
{
    switch( event.type )
    {
        /* again with the cases */
    }
}

Of course, most people HATE this method because it requires you to write mountains of boilerplate code.  Pity, then, it's also the fastest and most flexible.  :P  (It's a pity D doesn't have extension methods since then you could probably do something along the lines of LINQ to make the whole thing utterly painless... but then, I've given up on waiting for that.)

This is basically the only way to map xml parsing to ranges.  As for CONSUMING ranges, I think that'd be a bad idea for the same reason basing IO entirely on ranges is a bad idea.

The only other use for ranges I can think of is one already mentioned by Benji: traversal of a DOM.  Ranges don't apply to SAX because that's what pull parsing is. :D

To Andrei: I sometimes worry that your... enthusiasm for ranges is going
to leave us with range-based APIs that don't make any sense or are
horribly slow (IO in particular has me worried).  But then, I suppose
that also makes you the perfect person to figure out where they CAN be used.

Plus, that way it's your fault if it doesn't work out.  :P
July 31, 2009
== Quote from Daniel Keep (daniel.keep.lists@gmail.com)'s article

> This is basically the only way to map xml parsing to ranges.  As for
> CONSUMING ranges, I think that'd be a bad idea for the same reason
> basing IO entirely on ranges is a bad idea.
> The only other use for ranges I can think of is one already mentioned by
> Benji: traversal of a DOM.  Ranges don't apply to SAX because that's
> what pull parsing is. :D
I agree with it. Net IO maybe be blocked so often, then the program will pause until there is some data arrived. And when I want to write a no-block http server with one thread running, the range IO will also block it, the other socket with data ready will also be blocked.
July 31, 2009
Daniel Keep wrote:
> 
>> Andrei Alexandrescu wrote:
>>> It would be great if you could contribute to Phobos. Two things I hope
>>> from any replacement (a) works with ranges and ideally outputs ranges,
>>> (b) uses alias functions instead of delegates if necessary.
> 
> There's really only one sane way to map XML parsing to ranges: pull
> parsing, which is more or less already a range.  For those unfamiliar
> with it, this is how you use Tango's pull parser right now:
[snip]
> This would fairly naturally map to a range of parsing events and look
> something like:
> 
> foreach( event ; new PullParser!(char)(xmlSource) )
> {
>     switch( event.type )
>     {
>         /* again with the cases */
>     }
> }

Looks great. The network I/O could run separately too, in a consumer/producer fasion.

> Of course, most people HATE this method because it requires you to write
> mountains of boilerplate code.  Pity, then, it's also the fastest and
> most flexible.  :P  (It's a pity D doesn't have extension methods since
> then you could probably do something along the lines of LINQ to make the
> whole thing utterly painless... but then, I've given up on waiting for
> that.)
> 
> This is basically the only way to map xml parsing to ranges.  As for
> CONSUMING ranges, I think that'd be a bad idea for the same reason
> basing IO entirely on ranges is a bad idea.

Interesting. Could you please give more details about this? Why is range-based I/O a bad idea, and what can we do to make it a better one?

And what's the way that avoids writing boilerplate code but is slower? Is that the method that calls virtual functions (or delegates) upon each element received?

> The only other use for ranges I can think of is one already mentioned by
> Benji: traversal of a DOM.  Ranges don't apply to SAX because that's
> what pull parsing is. :D

Yah, I was thinking of the DOM traversal too. Yum.

> To Andrei: I sometimes worry that your... enthusiasm for ranges is going
> to leave us with range-based APIs that don't make any sense or are
> horribly slow (IO in particular has me worried).  But then, I suppose
> that also makes you the perfect person to figure out where they CAN be used.
> 
> Plus, that way it's your fault if it doesn't work out.  :P

I don't think you need to worry about my doing anything that's inherently slow. Performance is a big concern where I'm coming from (and also where I'm going to; incidentally all job offers I've been getting are in high-performance computing). As for excessive enthusiasm, that's always a risk but I'm sure I'll hear from you all if I lose my bearings.


Andrei
July 31, 2009
Andrei Alexandrescu wrote:
> Daniel Keep wrote:
>> ...

>> Of course, most people HATE this method because it requires you to write mountains of boilerplate code.  Pity, then, it's also the fastest and most flexible.  :P  (It's a pity D doesn't have extension methods since then you could probably do something along the lines of LINQ to make the whole thing utterly painless... but then, I've given up on waiting for that.)
>>
>> This is basically the only way to map xml parsing to ranges.  As for CONSUMING ranges, I think that'd be a bad idea for the same reason basing IO entirely on ranges is a bad idea.
> 
> Interesting. Could you please give more details about this? Why is range-based I/O a bad idea, and what can we do to make it a better one?

(A clarification: I *should* have said "...basing IO entirely on ranges
is -probably- a bad idea".)

<rambling>

My concern is the interface.

Let's take a hypothetical input range that reads from a file.  Since we're parsing XML, we want it to be character data.  So the interface might look something like:

struct Stream(T)
{
    T front();
    bool empty();
    void next();
}

(I realise I probably got at least one name wrong; I can't be bothered digging up the exact names, and it's irrelevant anyway :P)

My concern is that front returns T: a single character.

I wrote an archival tool many, many years ago in VB.  It worked by reading and writing a single byte at a time, and naturally performed shockingly.  I knew there had to be a faster way since other programs didn't crawl like mine was and discovered that reading/writing in larger blocks gave significantly better performance. [1]

Much of the performance of Tango's IO system (and from the XML parsing
code, too) is that it operates on big arrays wherever it can.  Hell, the
pull parser is, as far as anyone is able to tell, faster than every
other XML parser in existence specifically because it reads the whole
file in one IO operation and then just deals with slices and array access.

That's one half of my worry with this: that the range interface specifically precludes efficient batch operations.

Another, somewhat smaller concern, is that the range interface is back-to-front for IO.

Consider a stream: you don't know if the stream is empty until you attempt to read past the end of it.  Standard input does this, network sockets do this... probably others.

But the range interface asks "is this empty?", which you can't answer until you attempt to read from it.  So to implement .empty for a hypothetical stdin range, you'd need to try reading past the current location.  If you get a character, you've just modified the underlying stream.

(Actually, this is more of a concern for me in any situation where computing the next element of a range is an expensive operation, or an operation with side-effects.  I had the same issue when attempting to bind coroutines to the opApply interface.  You had to eagerly compute the next value in order to answer the question: is there a next element?)

Maybe these won't turn out to be problems in practice.  But my gut feeling is that IO would be better served by a Tango-style interface (putting the emphasis on efficient block transfers), with ranges wrapping that if you're willing to maybe take a performance hit.

</rambling>

Just my exceedingly verbose AU$0.02.

> And what's the way that avoids writing boilerplate code but is slower? Is that the method that calls virtual functions (or delegates) upon each element received?

(Deleted lots of rambling)

The problem with calling a delegate for every element received is that all the interfaces that do this suck.  SAX is the prime example of this.

Looking at stuff like Rx
(http://themechanicalbride.blogspot.com/2009/07/introducing-rx-linq-to-events.html),
I'm convinced there must be a way of doing it WELL.  I just don't know
what it is yet.

> ...


[1] I learned so much more back then when I had NO idea what I was doing, and thus made lots of mistakes.  Sadly, I have a strong physical aversion to making mistakes, so now I don't take risks.  And because I know I know I don't like taking risks, I can't trick myself into taking them.  Curse my endlessly recursive consciousness!
July 31, 2009
On 2009-07-30 22:42:29 -0400, Benji Smith <dlanguage@benjismith.net> said:

>> Michael Rynn wrote:
>>> I did look at the code for the xml module, and posted a suggested bug
>>> fix to the empty elements problem. I do not have access rights to
>>> updating the source repository, and at the time was too busy for this.
> 
> Andrei Alexandrescu wrote:
>> It would be great if you could contribute to Phobos. Two things I hope from any replacement (a) works with ranges and ideally outputs ranges, (b) uses alias functions instead of delegates if necessary.
> 
> Interesting. Most XML parsers either produce a "Document" object, or they just execute SAX callbacks. If an XML parser returned a range object, how would you use it?
> 
> Usually, I use something like XPath to extract information from an XML doc. Something liek this:
> 
>     auto doc = parser.parse(xml);
>     auto nodes = doc.select("/root//whatever[0][@id]");
> 
> I can see how you might do depth-first or breadth-first traversal of the DOM tree, or inorder traversal of the SAX events, with a range. But that's now how most people use XML. Are there are other range tricks up your sleeve that would support the a DOM or XPath kind of model?

A range is mostly a list of things. In the example above, doc.select could return a range to lazily evaluate the query instead of computing the whole query and returning all the elements. This way, if you only care about the first result you just take the first and don't have to compute them all.

Ranges can be used everywehere there are lists, and are especially useful for lazy lists that compute things as you go. I made an XML tokenizer (similar to Tango's pull parser) with a range API. Basically, you iterate over various kinds of token made available through an Algebraic, and as you advance it parses the document to get you the next token. (It'd be more useful if you could switch on various kinds of tokens with an Algebraic -- right now you need to use "if (token.peek!OpenElementToken)" -- but that's a problem with Algebraic that should get fixed I believe, or else I'll have to use something else.)

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

July 31, 2009
On Fri, 31 Jul 2009 02:26:34 -0400, Daniel Keep <daniel.keep.lists@gmail.com> wrote:

> Another, somewhat smaller concern, is that the range interface is
> back-to-front for IO.
>
> Consider a stream: you don't know if the stream is empty until you
> attempt to read past the end of it.  Standard input does this, network
> sockets do this... probably others.
>
> But the range interface asks "is this empty?", which you can't answer
> until you attempt to read from it.  So to implement .empty for a
> hypothetical stdin range, you'd need to try reading past the current
> location.  If you get a character, you've just modified the underlying
> stream.


These problems have been discussed before, I hope they can be solved.  I agree with you that streams do not fit the range interface particularly well, I think the best solution might be to NOT use ranges to implement streams, but allow applying ranges on top of streams for interfacing with other ranges.

Here is one past discussion you may find interesting: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=90971  This is the first posting, but towards the end is where I came to the realization that ranges and streams don't fit together perfectly.

-Steve
July 31, 2009
Daniel Keep wrote:
> Andrei Alexandrescu wrote:
>> Interesting. Could you please give more details about this? Why is
>> range-based I/O a bad idea, and what can we do to make it a better one?
> 
> (A clarification: I *should* have said "...basing IO entirely on ranges
> is -probably- a bad idea".)
> 
> <rambling>
> 
> My concern is the interface.
> 
> Let's take a hypothetical input range that reads from a file.  Since
> we're parsing XML, we want it to be character data.  So the interface
> might look something like:
> 
> struct Stream(T)
> {
>     T front();
>     bool empty();
>     void next();
> }
> 
> (I realise I probably got at least one name wrong; I can't be bothered
> digging up the exact names, and it's irrelevant anyway :P)

Yah, we had to choose popFront instead of the shorter next because there was no obvious corresponding "txen" to extract the last element.

> My concern is that front returns T: a single character.
> 
> I wrote an archival tool many, many years ago in VB.  It worked by
> reading and writing a single byte at a time, and naturally performed
> shockingly.  I knew there had to be a faster way since other programs
> didn't crawl like mine was and discovered that reading/writing in larger
> blocks gave significantly better performance. [1]

I see, and I'm glad to dissipate this concern. There are three interfaces that Phobos will define: byChar, byLine, and byBlock. So you get to choose the transfer unit and transfer mechanism. (byLine allows you to choose the separator too.) Nowadays I use text files often so I use byLine. It's very rare that you want to process input one character at a time, and indeed it would suck if the infrastructure would insist that that's the unit of transfer.

> Much of the performance of Tango's IO system (and from the XML parsing
> code, too) is that it operates on big arrays wherever it can.  Hell, the
> pull parser is, as far as anyone is able to tell, faster than every
> other XML parser in existence specifically because it reads the whole
> file in one IO operation and then just deals with slices and array access.

(That's great, but isn't sometimes the file a socket stream?)

I don't see this approach clashing with ranges because arrays are ranges so this setup is very natural to implement with ranges.

> That's one half of my worry with this: that the range interface
> specifically precludes efficient batch operations.

Hope this went away.

> Another, somewhat smaller concern, is that the range interface is
> back-to-front for IO.
> 
> Consider a stream: you don't know if the stream is empty until you
> attempt to read past the end of it.  Standard input does this, network
> sockets do this... probably others.
>
> But the range interface asks "is this empty?", which you can't answer
> until you attempt to read from it.  So to implement .empty for a
> hypothetical stdin range, you'd need to try reading past the current
> location.  If you get a character, you've just modified the underlying
> stream.

Yah, however note that if you subsequently copy the range, the already-read front is also copied so there's no loss. Problems appear if you create e.g. two input ranges from the same FILE* or socket or whatnot.

Walter and I discussed this problem for a long time. I also discussed the problem in the newsgroup. I argued that the simplest and most natural interface for a pure input stream has only one function getNext which at the same time gets the element and bumps the stream. Unfortunately, since all forward ranges are also input ranges, that interface must also work well for all other ranges (e.g. arrays), in which case it would be contorted. We decided to define what we now have.

> (Actually, this is more of a concern for me in any situation where
> computing the next element of a range is an expensive operation, or an
> operation with side-effects.  I had the same issue when attempting to
> bind coroutines to the opApply interface.  You had to eagerly compute
> the next value in order to answer the question: is there a next element?)

Yah but you can always cache the result of the computation. The remaining annoyance is that the side effect occurs earlier than you'd expect.

> Maybe these won't turn out to be problems in practice.  But my gut
> feeling is that IO would be better served by a Tango-style interface
> (putting the emphasis on efficient block transfers), with ranges
> wrapping that if you're willing to maybe take a performance hit.

I think we can do better by defining a general interface that will work for arrays as good as if hand-written.

The MSB is that ranges and block transfer are not at all in conflict.



Andrei
August 01, 2009
Michel Fortin wrote:
> Benji Smith wrote:
>>
>> Usually, I use something like XPath to extract information from an XML doc. Something liek this:
>>
>>     auto doc = parser.parse(xml);
>>     auto nodes = doc.select("/root//whatever[0][@id]");
>>
>> I can see how you might do depth-first or breadth-first traversal of the DOM tree, or inorder traversal of the SAX events, with a range. But that's now how most people use XML. Are there are other range tricks up your sleeve that would support the a DOM or XPath kind of model?
> 
> A range is mostly a list of things. In the example above, doc.select could return a range to lazily evaluate the query instead of computing the whole query and returning all the elements. This way, if you only care about the first result you just take the first and don't have to compute them all.
> 
> Ranges can be used everywehere there are lists, and are especially useful for lazy lists that compute things as you go. I made an XML tokenizer (similar to Tango's pull parser) with a range API. Basically, you iterate over various kinds of token made available through an Algebraic, and as you advance it parses the document to get you the next token. (It'd be more useful if you could switch on various kinds of tokens with an Algebraic -- right now you need to use "if (token.peek!OpenElementToken)" -- but that's a problem with Algebraic that should get fixed I believe, or else I'll have to use something else.)

But XML documents aren't really lists. They're trees.

Do ranges provide an abstraction for working with trees (other than the obvious flattening algorithms, like breadth-first or depth-first traversal)?

--benji
August 01, 2009
On 2009-08-01 00:04:01 -0400, Benji Smith <dlanguage@benjismith.net> said:

> But XML documents aren't really lists. They're trees.
> 
> Do ranges provide an abstraction for working with trees (other than the obvious flattening algorithms, like breadth-first or depth-first traversal)?

Well, it depends at what level you look. An XML document you read is first a list of bytes, then a list of Unicode characters, then you convert those characters to a list of tokens -- the Tango pull-parser sees each tag and each attribute as a token, SAX define each tag (including attributes) as a token and calls it an event -- and from that list of token you can construct a tree.

The tree isn't a list though, and a range is a unidimentional list of something. You need another interface to work with the tree.

But then, from the tree, create a list in one way or another (flattening, or performing an XPath query for instance) and then you can have a range representing the list of subtrees for the query if you want. That's pretty good since with a range you can lazily iterate over the results.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

August 02, 2009
Michel Fortin wrote:
> On 2009-08-01 00:04:01 -0400, Benji Smith <dlanguage@benjismith.net> said:
> 
>> But XML documents aren't really lists. They're trees.
>>
>> Do ranges provide an abstraction for working with trees (other than the obvious flattening algorithms, like breadth-first or depth-first traversal)?
> 
> Well, it depends at what level you look. An XML document you read is first a list of bytes, then a list of Unicode characters, then you convert those characters to a list of tokens -- the Tango pull-parser sees each tag and each attribute as a token, SAX define each tag (including attributes) as a token and calls it an event -- and from that list of token you can construct a tree.
> 
> The tree isn't a list though, and a range is a unidimentional list of something. You need another interface to work with the tree.
> 
> But then, from the tree, create a list in one way or another (flattening, or performing an XPath query for instance) and then you can have a range representing the list of subtrees for the query if you want. That's pretty good since with a range you can lazily iterate over the results.

Oh sure. I agree that a range-based way of iterating over tokens is cool. And a range-based API for walking through the results of an XPath query would be great. But the real meat and potatoes of an XML API would need to be something more DOM-like, with a tree structure.

The only reason I chimed in, in the first place, was Andrei's post saying that a replacement XML parser "ideally outputs ranges".

I don't think that's right. Ideally, an XML parser outputs a tree structure.

Though a range-based mechanism for traversing that tree would be nice too.

--benji