Jump to page: 1 24  
Page
Thread overview
stream interfaces - with ranges
May 18, 2012
kenji hara
May 18, 2012
Dmitry Olshansky
May 18, 2012
Artur Skawina
May 18, 2012
kenji hara
May 18, 2012
kenji hara
May 18, 2012
kenji hara
May 18, 2012
Mehrdad
May 18, 2012
Roman D. Boiko
May 18, 2012
Mehrdad
May 18, 2012
Mehrdad
May 18, 2012
Mehrdad
May 18, 2012
Mehrdad
May 21, 2012
Martin Nowak
May 18, 2012
Artur Skawina
May 18, 2012
kenji hara
May 18, 2012
David Nadlinger
May 18, 2012
Artur Skawina
May 18, 2012
kenji hara
May 18, 2012
Artur Skawina
May 18, 2012
Artur Skawina
May 19, 2012
Masahiro Nakagawa
May 18, 2012
Artur Skawina
May 19, 2012
kenji hara
May 19, 2012
Masahiro Nakagawa
May 21, 2012
Christophe Travert
May 17, 2012
OK, so I had a couple partially written replies on the 'deprecating
std.stream etc' thread, then I had to go home.

But I thought about this a lot last night, and some of the things Andrei
and others are saying is starting to make sense (I know!).  Now I've
scrapped those replies and am thinking about redesigning my i/o package
(most of the code can stay intact).

I'm a little undecided on some of the details, but here is what I think
makes sense:

1. We need a buffering input stream type.  This must have additional
methods besides the range primitives, because doing one-at-a-time byte
reads is not going to cut it.
2. I realized, buffering input stream of type T is actually an input range
of type T[].  Observe:

struct /*or class*/ buffer(T)
{
     T[] buf;
     InputStream input;
     ...
     @property T[] front() { return buf; }
     void popFront() {input.read(buf);} // flush existing buffer, read next.
     @property bool empty() { return buf.length == 0;}
}

Roughly speaking, not all the details are handled, but this makes a
feasible input range that will perform quite nicely for things like
std.algorithm.copy.  I haven't checked, but copy should be able to handle
transferring a range of type T[] to an output range with element type T,
if it's not able to, it should be made to work.  I know at least, an
output stream with element type T supports putting T or T[].  What I think
really makes sense is to support:

buffer!ubyte b;
outputStream o;

o.put(b); // uses range primitives to put all the data to o, one element
(i.e. ubyte[]) of b at a time


3. An ultimate goal of the i/o streaming package should be to be able to
do this:

auto x = new XmlParser("<rootElement></rootElement>");

or at least

auto x = new XmlParser(buffered("<rootElement></rootElement>"));

So I think arrays need to be able to be treated as a buffering streams.  I
tried really hard to think of some way to make this work with my existing
system, but I don't think it will without unnecessary baggage, and losing
interoperability with existing range functions.

Where does this leave us?

1. I think we need, as Andrei says, an unbuffered streaming abstraction.
I think I have this down pretty solidly in my current std.io.
2. A definition of a buffering range, in terms of what additional
primitives the range should have.  The primitives should support buffered
input and buffered output (these are two separate algorithms), but
independently (possibly allowing switching for rw files).
3. An implementation of the above definition hooked to the unbuffered
stream abstraction, to be utilized in more specific ranges.  But by
itself, can be used as an input range or directly by code.
4. Specialization ranges for each type of input you want (i.e. byLine,
byChunk, textStream).
5. Full replacement option of File backend.  File will start out with
C-supported calls, but any "promotion" to using a more D-like range type
will result in switching to a D-based stream using the above mechanisms.
Of course, all existing code should compile that does not try to assume
the File always has a valid FILE *.

What do you all think?  I'm going to work out what the definition of 2
should be, based on what I've written and what makes sense.

Have I started to design something feasible or unworkable? :)

-Steve
May 17, 2012
On 5/17/12 9:02 AM, Steven Schveighoffer wrote:
> 1. We need a buffering input stream type. This must have additional
> methods besides the range primitives, because doing one-at-a-time byte
> reads is not going to cut it.

I was thinking a range of T[] could be enough for a buffered input range.

> 2. I realized, buffering input stream of type T is actually an input range
> of type T[]. Observe:

Ah, there we go :o).

> struct /*or class*/ buffer(T)
> {
> T[] buf;
> InputStream input;
> ...
> @property T[] front() { return buf; }
> void popFront() {input.read(buf);} // flush existing buffer, read next.
> @property bool empty() { return buf.length == 0;}
> }
>
> Roughly speaking, not all the details are handled, but this makes a
> feasible input range that will perform quite nicely for things like
> std.algorithm.copy. I haven't checked, but copy should be able to handle
> transferring a range of type T[] to an output range with element type T,
> if it's not able to, it should be made to work.

We can do this for copy, but if we need to specialize a lot of other algorithms, maybe we didn't strike the best design.

> I know at least, an
> output stream with element type T supports putting T or T[].

Right.

> What I think
> really makes sense is to support:
>
> buffer!ubyte b;
> outputStream o;
>
> o.put(b); // uses range primitives to put all the data to o, one element
> (i.e. ubyte[]) of b at a time

I think that makes sense.

> 3. An ultimate goal of the i/o streaming package should be to be able to
> do this:
>
> auto x = new XmlParser("<rootElement></rootElement>");
>
> or at least
>
> auto x = new XmlParser(buffered("<rootElement></rootElement>"));
>
> So I think arrays need to be able to be treated as a buffering streams. I
> tried really hard to think of some way to make this work with my existing
> system, but I don't think it will without unnecessary baggage, and losing
> interoperability with existing range functions.

I think we can create a generic abstraction buffered() that layers buffering on top of an input range. If the input range has unbuffered read capability, buffered() would use those. Otherwise, it would use loops using empty, front, and popFront.

> Where does this leave us?
>
> 1. I think we need, as Andrei says, an unbuffered streaming abstraction.
> I think I have this down pretty solidly in my current std.io.

Great. What are the primitives?

> 2. A definition of a buffering range, in terms of what additional
> primitives the range should have. The primitives should support buffered
> input and buffered output (these are two separate algorithms), but
> independently (possibly allowing switching for rw files).

Sounds good.

> 3. An implementation of the above definition hooked to the unbuffered
> stream abstraction, to be utilized in more specific ranges. But by
> itself, can be used as an input range or directly by code.

Hah, I can't believe I wrote about the same thing above (and I swear I didn't read yours).

> 4. Specialization ranges for each type of input you want (i.e. byLine,
> byChunk, textStream).

What is the purpose? To avoid unnecessary double buffering?

> 5. Full replacement option of File backend. File will start out with
> C-supported calls, but any "promotion" to using a more D-like range type
> will result in switching to a D-based stream using the above mechanisms.
> Of course, all existing code should compile that does not try to assume
> the File always has a valid FILE *.

This will be tricky but probably doable.


Andrei
May 17, 2012
On Thu, 17 May 2012 11:46:18 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 5/17/12 9:02 AM, Steven Schveighoffer wrote:
>> Roughly speaking, not all the details are handled, but this makes a
>> feasible input range that will perform quite nicely for things like
>> std.algorithm.copy. I haven't checked, but copy should be able to handle
>> transferring a range of type T[] to an output range with element type T,
>> if it's not able to, it should be made to work.
>
> We can do this for copy, but if we need to specialize a lot of other algorithms, maybe we didn't strike the best design.

Right.  The thing is, buffered streams are good as plain ranges for one thing -- forwarding data.  There probably aren't many algorithms in std.algorithm that are applicable.  And there is always the put idiom, Appender.put(buf) should work to accumulate all data into an array, which can then be used as a normal range.

One thing that worries me, if you did something like array(bufferedStream), it would accumulate N copies of the buffer reference, which wouldn't be what you want at all.  Of course, you could apply map to buffer to dup it.

>> 3. An ultimate goal of the i/o streaming package should be to be able to
>> do this:
>>
>> auto x = new XmlParser("<rootElement></rootElement>");
>>
>> or at least
>>
>> auto x = new XmlParser(buffered("<rootElement></rootElement>"));
>>
>> So I think arrays need to be able to be treated as a buffering streams. I
>> tried really hard to think of some way to make this work with my existing
>> system, but I don't think it will without unnecessary baggage, and losing
>> interoperability with existing range functions.
>
> I think we can create a generic abstraction buffered() that layers buffering on top of an input range. If the input range has unbuffered read capability, buffered() would use those. Otherwise, it would use loops using empty, front, and popFront.

Right, this is different from my proposed buffer implementation, which puts a buffer on top of an unbuffered input *stream*.  But of course, we can define it for both, since it will be a compile-time interface.

>> Where does this leave us?
>>
>> 1. I think we need, as Andrei says, an unbuffered streaming abstraction.
>> I think I have this down pretty solidly in my current std.io.
>
> Great. What are the primitives?

See here:
https://github.com/schveiguy/phobos/blob/new-io2/std/io.d#L170

Through IODevice.  The BufferedStream type is going to be redone as a range.

>> 3. An implementation of the above definition hooked to the unbuffered
>> stream abstraction, to be utilized in more specific ranges. But by
>> itself, can be used as an input range or directly by code.
>
> Hah, I can't believe I wrote about the same thing above (and I swear I didn't read yours).

Well, not quite :)  You wrote about it being supported by an underlying range, I need to have it supported by an underlying stream.  We probably need both.  But yeah, I think we are mostly on the same page here.

>> 4. Specialization ranges for each type of input you want (i.e. byLine,
>> byChunk, textStream).
>
> What is the purpose? To avoid unnecessary double buffering?

No, a specialization range *uses* a buffer range as its backing.  A buffer range I think is necessarily going to be a reference type (probably a class). The specialized range won't replace the buffer range, in other words.

Something like byLine is going to do the work of extracting lines from the buffer, it will reference the buffer data directly.  But it won't reimplement buffering.

>> 5. Full replacement option of File backend. File will start out with
>> C-supported calls, but any "promotion" to using a more D-like range type
>> will result in switching to a D-based stream using the above mechanisms.
>> Of course, all existing code should compile that does not try to assume
>> the File always has a valid FILE *.
>
> This will be tricky but probably doable.

Doing this will unify all the i/o packages together into one interface -- File.  I think it's a bad story for D if you have 2 ways of doing i/o (or at least 2 ways of doing the *same thing* with i/o).

-Steve
May 18, 2012
I think range interface is not useful for *efficient* IO. The expected IO interface will be more *abstract* than range primitives.

---
If you use range I/F to read bytes from device, we will always do blocking IO - even if the device is socket. It is not efficient.

auto sock = new TcpSocketDevice();
if (sock.empty) { auto e = sock.front; }
  // In empty primitive, we *must* wait the socket gets one or more
bytes or really disconnected.
  // If not, what exactly returns sock.front?
  // Then using range interface for socket reading enforces blocking
IO. It is *really* inefficient.
---
I think IO primitives must be distinct from range ones for the reasons mentioned above...

I'm designing experimental IO primitives: https://github.com/9rnsr/dio

I call the input stream "source", and call output stream "sink".
"source" has a 'pull' primitive, and sink has 'push' primitive, and
they can avoid blocking.
If you want to construct input range interface from "source", you
should use 'ranged' helper function in io.core module. 'ranged'
returns a wrapper object, and in its front method, It reads bytes from
"source", and if the read bytes not sufficient, blocks the input.

In other words, range is not almighty. We should think distinct primitives for the IO.

Kenji Hara

2012/5/17 Steven Schveighoffer <schveiguy@yahoo.com>:
> OK, so I had a couple partially written replies on the 'deprecating std.stream etc' thread, then I had to go home.
>
> But I thought about this a lot last night, and some of the things Andrei and others are saying is starting to make sense (I know!).  Now I've scrapped those replies and am thinking about redesigning my i/o package (most of the code can stay intact).
>
> I'm a little undecided on some of the details, but here is what I think makes sense:
>
> 1. We need a buffering input stream type.  This must have additional
> methods besides the range primitives, because doing one-at-a-time byte
> reads is not going to cut it.
> 2. I realized, buffering input stream of type T is actually an input range
> of type T[].  Observe:
>
> struct /*or class*/ buffer(T)
> {
>     T[] buf;
>     InputStream input;
>     ...
>     @property T[] front() { return buf; }
>     void popFront() {input.read(buf);} // flush existing buffer, read next.
>     @property bool empty() { return buf.length == 0;}
> }
>
> Roughly speaking, not all the details are handled, but this makes a feasible input range that will perform quite nicely for things like std.algorithm.copy.  I haven't checked, but copy should be able to handle transferring a range of type T[] to an output range with element type T, if it's not able to, it should be made to work.  I know at least, an output stream with element type T supports putting T or T[].  What I think really makes sense is to support:
>
> buffer!ubyte b;
> outputStream o;
>
> o.put(b); // uses range primitives to put all the data to o, one element
> (i.e. ubyte[]) of b at a time
>
>
> 3. An ultimate goal of the i/o streaming package should be to be able to do this:
>
> auto x = new XmlParser("<rootElement></rootElement>");
>
> or at least
>
> auto x = new XmlParser(buffered("<rootElement></rootElement>"));
>
> So I think arrays need to be able to be treated as a buffering streams.  I tried really hard to think of some way to make this work with my existing system, but I don't think it will without unnecessary baggage, and losing interoperability with existing range functions.
>
> Where does this leave us?
>
> 1. I think we need, as Andrei says, an unbuffered streaming abstraction.
> I think I have this down pretty solidly in my current std.io.
> 2. A definition of a buffering range, in terms of what additional
> primitives the range should have.  The primitives should support buffered
> input and buffered output (these are two separate algorithms), but
> independently (possibly allowing switching for rw files).
> 3. An implementation of the above definition hooked to the unbuffered
> stream abstraction, to be utilized in more specific ranges.  But by
> itself, can be used as an input range or directly by code.
> 4. Specialization ranges for each type of input you want (i.e. byLine,
> byChunk, textStream).
> 5. Full replacement option of File backend.  File will start out with
> C-supported calls, but any "promotion" to using a more D-like range type
> will result in switching to a D-based stream using the above mechanisms.
> Of course, all existing code should compile that does not try to assume
> the File always has a valid FILE *.
>
> What do you all think?  I'm going to work out what the definition of 2 should be, based on what I've written and what makes sense.
>
> Have I started to design something feasible or unworkable? :)
>
> -Steve
May 18, 2012
On Thursday, 17 May 2012 at 14:02:09 UTC, Steven Schveighoffer wrote:
> 2. I realized, buffering input stream of type T is actually an input range of type T[].

The trouble is, why a slice? Why not an std.array.Array? Why not some other data source?
(Check/egg problem....)




Another problem I've noticed is the following:


Say you're tokenizing some input range, and it happens to just be a huge, gigantic string.

It *should* be possible to turn it into tokens with slices referring to the ORIGINAL string, which is VERY efficient because it doesn't require *any* heap allocations whatsoever. (You just tokenize with opApply() as you go, without every requiring a heap allocation...)

However, this is *only* possible if you don't use the concept of an input range!

Since you can't slice an input range, you'd be forced to use the front() and popFront() properties. But, as soon as you do that, you're gonna have to store the data somewhere... so your next-best option is to append it to some new gigantic array (instead of a bunch of small arrays, which require a lot of heap allocations), but even then, it's not as efficient as possible, because there's O(n) extra memory involved -- which defeats the whole purpose of working on small chunks at a time with no heap allocations.
(If you're going to do that, after all, you might as well read the entire thing into a giant string at the beginning, and work with an array anyway, discarding the whole idea of a range while doing your tokenization.)


Any ideas on how to solve this problem?
May 18, 2012
On Friday, 18 May 2012 at 07:52:57 UTC, Mehrdad wrote:
> On Thursday, 17 May 2012 at 14:02:09 UTC, Steven Schveighoffer wrote:
>> 2. I realized, buffering input stream of type T is actually an input range of type T[].
>
> The trouble is, why a slice? Why not an std.array.Array? Why not some other data source?
> (Check/egg problem....)
>
>
>
>
> Another problem I've noticed is the following:
>
>
> Say you're tokenizing some input range, and it happens to just be a huge, gigantic string.
>
> It *should* be possible to turn it into tokens with slices referring to the ORIGINAL string, which is VERY efficient because it doesn't require *any* heap allocations whatsoever. (You just tokenize with opApply() as you go, without every requiring a heap allocation...)
>
> However, this is *only* possible if you don't use the concept of an input range!
>
> Since you can't slice an input range, you'd be forced to use the front() and popFront() properties. But, as soon as you do that, you're gonna have to store the data somewhere... so your next-best option is to append it to some new gigantic array (instead of a bunch of small arrays, which require a lot of heap allocations), but even then, it's not as efficient as possible, because there's O(n) extra memory involved -- which defeats the whole purpose of working on small chunks at a time with no heap allocations.
> (If you're going to do that, after all, you might as well read the entire thing into a giant string at the beginning, and work with an array anyway, discarding the whole idea of a range while doing your tokenization.)
>
>
> Any ideas on how to solve this problem?
Provide slicing if underlying data source is compatible.

I have the same need in my DCT, and so far I went with a custom implementation (not on Github yet), but plan to reuse std.io as soon as it will be more or less stable and usable.
May 18, 2012
On 05/18/12 06:19, kenji hara wrote:
> I think range interface is not useful for *efficient* IO. The expected IO interface will be more *abstract* than range primitives.
> 
> ---
> If you use range I/F to read bytes from device, we will always do blocking IO - even if the device is socket. It is not efficient.
> 
> auto sock = new TcpSocketDevice();
> if (sock.empty) { auto e = sock.front; }
>   // In empty primitive, we *must* wait the socket gets one or more
> bytes or really disconnected.

No. 'empty' has to return true only _after_ seeing EOF.

Something like 'available' can return the number of elements known to be fetchable w/o blocking. [1]

>   // If not, what exactly returns sock.front?

EWOULDBLOCK :^)

But, yes, it needs to block, as there's no generic way to return
EAGAIN/EWOULDBLOCK. This is where the primitive returning a slice
comes in - that one /can/ return an empty slice.
So '!r.empty && r.fronts.length==0)' is the equivalent to EAGAIN.
(and note i'm oversimplifying -- 'fronts' can return something that
/acts/ as a slice; which is what i'm in fact are doing)

>   // Then using range interface for socket reading enforces blocking
> IO. It is *really* inefficient.

> I think IO primitives must be distinct from range ones for the reasons mentioned above...
> 
> I'm designing experimental IO primitives: https://github.com/9rnsr/dio
> 
> I call the input stream "source", and call output stream "sink".
> "source" has a 'pull' primitive, and sink has 'push' primitive, and
> they can avoid blocking.
> If you want to construct input range interface from "source", you
> should use 'ranged' helper function in io.core module. 'ranged'
> returns a wrapper object, and in its front method, It reads bytes from
> "source", and if the read bytes not sufficient, blocks the input.
> 
> In other words, range is not almighty. We should think distinct primitives for the IO.

Well, your 'pull' and 'push' are just different names for my 'fronts' and 'puts' (modulo the data transfer interface, which can be done both ways using a set of overloads, hence it doesn't matter).

I don't see any reason to invent yet another abstraction, when ranges can be made to work with some improvements.

Ranges are just a convention; not a perfect one, but having /one/, not
two or thirteen, is valuable. If you think ranges are flawed the
discussion should be about ripping out every trace of them from the
language and libraries and replacing them with something better. If
you think that would be bad - well, having tens of different incompatible
abstractions isn't good either. (and, yes, you can provide glue so that
they can interact, but that does not scale well)

Hmm, how are 'flush()' and 'commit()' supposed to work? Is data lost
if you omit one or both of them?

artur

[1] Reminds me:

   struct S(T) {
      shared T a;
      @property size_t available()() { return a; }
   }

The compiler infers length as 'pure', which, depending on the
definition of 'shared' is wrong. ('shared' /shouldn't/ imply 'volatile',
but, as it is now, it does - so omitting a call to 'available' would
be wrong)

May 18, 2012
On 18.05.2012 8:19, kenji hara wrote:
> I think range interface is not useful for *efficient* IO. The expected
> IO interface will be more *abstract* than range primitives.
>
> ---
> If you use range I/F to read bytes from device, we will always do
> blocking IO - even if the device is socket. It is not efficient.
>
> auto sock = new TcpSocketDevice();
> if (sock.empty) { auto e = sock.front; }
>    // In empty primitive, we *must* wait the socket gets one or more
> bytes or really disconnected.
>    // If not, what exactly returns sock.front?
>    // Then using range interface for socket reading enforces blocking
> IO. It is *really* inefficient.
> ---

There is no problem with blocking _interface_. That is the facade. The actual work can happen in background thread (and in fact it often is).
So while you work with first chunk the next one is downloaded behind the scenes.
Just take a look at std.net.curl all these asyncByChunk ... and then there is vide.d that shows that having blocking interface for asynchronous i/o is alright.

-- 
Dmitry Olshansky
May 18, 2012
On 05/18/12 13:34, Dmitry Olshansky wrote:
> On 18.05.2012 8:19, kenji hara wrote:
>> I think range interface is not useful for *efficient* IO. The expected IO interface will be more *abstract* than range primitives.
>>
>> ---
>> If you use range I/F to read bytes from device, we will always do blocking IO - even if the device is socket. It is not efficient.
>>
>> auto sock = new TcpSocketDevice();
>> if (sock.empty) { auto e = sock.front; }
>>    // In empty primitive, we *must* wait the socket gets one or more
>> bytes or really disconnected.
>>    // If not, what exactly returns sock.front?
>>    // Then using range interface for socket reading enforces blocking
>> IO. It is *really* inefficient.
>> ---
> 
> There is no problem with blocking _interface_. That is the facade. The actual work can happen in background thread (and in fact it often is).
> So while you work with first chunk the next one is downloaded behind the scenes.
> Just take a look at std.net.curl all these asyncByChunk ... and then there is vide.d that shows that having blocking interface for asynchronous i/o is alright.

I just took a look, and yes, that's yet another slightly different implementation of the same thing with a somewhat different interface:

   https://github.com/rejectedsoftware/vibe.d/blob/399b7a9d6eba9b14ea8d2215498daf53bd8d27d8/source/vibe/stream/stream.d

I thought i was exaggerating when i said 'thirteen', but there are already more of them mentioned in this thread than i could count on one hand...

This one has an implicit flush and also this: "Finalize has to be called on certain types of streams.". Not to mention it's class based.

artur
May 18, 2012
On Fri, 18 May 2012 00:19:45 -0400, kenji hara <k.hara.pg@gmail.com> wrote:

> I think range interface is not useful for *efficient* IO. The expected
> IO interface will be more *abstract* than range primitives.

If all you are doing is consuming data and processing it, range interface is efficient.  Most streaming implementations that are synchronous use:

1. read block of data from low-level source into buffer
2. process buffer
3. If still data left, go to step 1.

1 is done via popFront, 2 is done via front.

3 is somewhat available via empty, but empty kind of depends on reading data.  I think it can work.

It's not the ideal interface for all aspects of i/o, but it does map to ranges, and for single purpose tasks (such as parse an XML file), it will be most efficient.

> ---
> If you use range I/F to read bytes from device, we will always do
> blocking IO - even if the device is socket. It is not efficient.
>
> auto sock = new TcpSocketDevice();
> if (sock.empty) { auto e = sock.front; }
>   // In empty primitive, we *must* wait the socket gets one or more
> bytes or really disconnected.
>   // If not, what exactly returns sock.front?
>   // Then using range interface for socket reading enforces blocking
> IO. It is *really* inefficient.
> ---

sockets do not have to be blocking, and I/O does not have to use the range portion of the interface.

And efficient I/O has little to do with synchronicity and more to do with reading a large amount of data at a time instead of byte by byte.

Using multi-threads or fibers, and using OS primitives such as select or poll can make I/O quite efficient and allow you to do other things while no I/O is happening.  These will not happen with range interface, but will be available through other interfaces.

> I think IO primitives must be distinct from range ones for the reasons
> mentioned above...

Yes, I agree.  But ranges can be *mapped* to stream primitives.

> I'm designing experimental IO primitives:
> https://github.com/9rnsr/dio

I'll take a look.

>
> In other words, range is not almighty. We should think distinct
> primitives for the IO.

100% agree.  The main thing I realized that brought me to propose the "range-based" (if you can call it that) version is that:

1. Ranges can be readily mapped to stream primitives *if* you use the concept of a range of T[] vs. a range of T.  So in essence, without changing anything I can slap on a range interface for free.
2. Arrays make very efficient data sources, and are easy to create.  We need a way to hook stream-using code onto an array.

But be clear, I am *not* going to remove the existing stream I/O primitives I had for buffered i/o, I'm rather *adding* range primitives as well.

-Steve
« First   ‹ Prev
1 2 3 4