April 07, 2010
On 03/30/2010 06:03 AM, Steve Schveighoffer wrote:
> Can you detail the reader/writer operations?  Because popNext and front ain't gonna cut it for streams.  Most of the time the element size you want is defined by the application (and not easily abstracted), not the range, but the range is in charge of the element size (via front).

I think there's no necessity that a range has a constant-size element type. If the element type of the range is e.g. ubyte[], then it can be free to transfer as much data as it wants at each popFront().

> While I believe ranges can be useful for streams, they are not the best interface for all applications.  For example, if I have a protocol that reads 2 bytes to get a length, and then reads length bytes from the stream, a range of those units is probably a good abstraction.  But I don't want to resort to C calls to create that abstraction -- there should be a nice D layer in between.  I should not have to create my own buffering solution.  I/O performance is more important IMO than interface when it comes to streams.  This does not mean big-O complexity, I'm talking about raw performance.

I think the interface you described can be easily modeled as a range of ubyte[]. Again, a range of ubyte[] doesn't have to return the same number of elements each step.

> I hope we can see a design before you commit to doing it this way. For example, a zip library uses a range as a source, what does the file range look like that satisfies the range properties and also is efficient?  Just seeing the API should be enough to judge.

Feel free to suggest one. I won't be able until I study the zip file format.

> And are there plans to make a good abstracted library for streams that custom ranges can be built upon?

Not sure I understand this.


Andrei
April 07, 2010
On 04/07/2010 04:09 AM, Lars Tandle Kyllingstad wrote:
> Sorry, I totally forgot to answer this one.  I don't know if I've earned the right to vote here yet, but if so I vote 'yes'. (Or was it 'yarrrr'?)
>
> Regarding the use of std.stdio.File: Though I agree with Steve that a native D solution for buffered IO would be way better than relying on FILE*, I really like the *interface* of File. (In particular, I like the byLine and byChunk ranges.)
>
> And that's really all we need to worry about at the moment -- find a good interface now, improve the implementation later. A good interface should be as implementation independent as possible, so there will be minimal breakage to user code later.

Agreed.

BTW File will receive a couple more ranges: byDchar which spans the file one dchar at a time, byWchar which is sometimes necessary due to the awkward fwide API (see http://www.opengroup.org/onlinepubs/007908775/xsh/fwide.html), and byChar for getting one narrow character at a time. (Or should that be byUbyte or something?)

As I mentioned in another email, these are implementable with good efficiency using nonstandard APIs. Otherwise each iteration will cost one call to fgetc and one call to fungetc.


Andrei
April 08, 2010
Forwarding back to phobos mailing list, because my mail client messed up, I accidentally made this a private conversation.



----- Original Message ----
> From: Andrei Alexandrescu <andrei at erdani.com>
> 
> On 04/08/2010 07:14 AM, Steve Schveighoffer wrote:
> > I agree that a range
> > of such packets is possible, but the problem is,
> > how does one build such
> > a range?  If all you have as an interface to
> > a network socket is a
> > range, how do you make *that* range spit out
> > what you need?  In
> > other words, you want a D abstraction to the
> > syscalls, I think we all
> > agree on that.  I don't think a range
> > abstraction is good enough
> > for all purposes.
>
> The network socket is not a range, it's a File, and File does have primitives such as rawWrite and rawRead, which we can add to and improve.
> 
> File offers ranges, but you're not required to use them.

That's not what I read from Walter's comment...  He indicated that something like an e.g. zip library should take a range as input.  This implies that all streams are shoehorned into range form.

Using File is more like what I thought it should be.  If this is the case, then I think we can have a workable solution.

> > If you follow through the logic of how such a system would
> > be built,
> > I think the best abstraction is a layer that abstracts out
> > the
> > read/write functionality (unbuffered), and then build
> > ranges/buffered
> > i/o on top of that.  The abstraction can be
> > compile-time, we don't
> > need to do interfaces here.
> 
> Makes sense. I'm just a bit worried about stdio's poor buffering interface. It only offers setvbuf(), which is quite opaque.

The only reason to use FILE * as the underlying implementation is to be compatible with C's (f)printf.  It makes sense that you only need that compatibility for printing to a standard handle.  I think we can probably come up with an abstraction layer that uses FILE* only when dealing with standard handles.

In that case, we are no longer limited to FILE*'s capabilities for other I/O types (e.g. sockets, IPC).

> > A simple answer here would be: "A x range of type T"  where
> > you
> > substitute x for input, forward, etc. and T for the type returned
> > by
> > front.  And I'm not talking about the zip library, I'm talking
> > about
> > the generic file/network stream.  It can't know that it's
> > being used
> > by a zip function.
> 
> I'd speculate that the zip file interface would need a seekable range - a range that is forward, but can be positioned with an extra primitive seek().
> 
> The element type of the range would be ubyte[]. The number of bytes transferred at a step should be settable via another primitive. So:
> 
> struct SeekableBufRange {
>   // Range primitives
>  @property bool empty();
> 
>   @property ubyte[] front();
>     void popFront();
>    // Extra primitives
>   @property size_t bufsize();
>   @property void bufsize(size_t);
>   @property ulong  position();
>   @property void position(ulong);
> }
> 
> How's that sound? This is one range that File could expose directly.

Horrible.  You are replacing a single function (rawRead) with all these functions:

empty()
front()
popFront()
bufsize()
bufsize(size_t)

That doesn't even cover the awkwardness of how the code now has to read N bytes (a single line with rawRead):

// read N bytes
source.bufsize = N;
auto data = source.front();
source.popFront();

And it also doesn't cover how you now have to tack on these functions to standard memory ranges.  Or how the stream-based range has to handle awkward situations where someone might call front several times before calling popFront (not possible with rawRead).  Or how you have no control over the inevitable buffering scheme required to support such awkwardness.

-Steve




April 08, 2010
On 04/08/2010 01:23 PM, Steve Schveighoffer wrote:
>> The network socket is not a range, it's a File, and File does have primitives such as rawWrite and rawRead, which we can add to and improve.
>>
>> File offers ranges, but you're not required to use them.
>
> That's not what I read from Walter's comment...  He indicated that something like an e.g. zip library should take a range as input. This implies that all streams are shoehorned into range form.

If the zip library works with ranges, we can use it for transparently handling in-memory zip manipulation and also zip file manipulation.

>> Makes sense. I'm just a bit worried about stdio's poor buffering interface. It only offers setvbuf(), which is quite opaque.
>
> The only reason to use FILE * as the underlying implementation is to be compatible with C's (f)printf.  It makes sense that you only need that compatibility for printing to a standard handle.  I think we can probably come up with an abstraction layer that uses FILE* only when dealing with standard handles.

It's more than printf. There are several I/O routines in stdio, and all use FILE* for both input and output. If a D application mixes calls to C APIs that do I/O with stdin, stdout, and stderr, we need to take a stance on what should happen.

> In that case, we are no longer limited to FILE*'s capabilities for other I/O types (e.g. sockets, IPC).

I think File does not need to be inextricably linked to FILE*.

>> The element type of the range would be ubyte[]. The number of bytes transferred at a step should be settable via another primitive. So:
>>
>> struct SeekableBufRange { // Range primitives @property bool
>> empty();
>>
>> @property ubyte[] front(); void popFront(); // Extra primitives
>> @property size_t bufsize(); @property void bufsize(size_t);
>> @property ulong  position(); @property void position(ulong); }
>>
>> How's that sound? This is one range that File could expose directly.
>
> Horrible.  You are replacing a single function (rawRead) with all
> these functions:
>
> empty() front() popFront() bufsize() bufsize(size_t)

I don't think that accurately represents what's going on. rawRead does need a fair amount of paraphernalia to work. For example:

// Consume input using rawRead
auto buffer = new ubyte[1024];
size_t read;
while ((read = input.rawRead(buffer).length) > 0) {
    auto usable = buffer[0 .. read];
    ... use usable ...
}

Not that elegant. Compare and contrast with:

// Consume input using a range
foreach (buffer; input.byChunk(1024)) {
     ... use buffer ...
}

// Consume input straight from a range
input.bufsize = 1024;
foreach (buffer; input) {
     ... use buffer ...
}

> That doesn't even cover the awkwardness of how the code now has to read N bytes (a single line with rawRead):

I think "awkwardness" doesn't describe it.

> // read N bytes
> source.bufsize = N;
> auto data = source.front();
> source.popFront();

I think it's more often to want to consume stuff in a stream manner, as opposed to attempting to read some isolated bits. Ranges are optimized for the former.

> And it also doesn't cover how you now have to tack on these functions to standard memory ranges.  Or how the stream-based range has to handle awkward situations where someone might call front several times before calling popFront (not possible with rawRead).  Or how you have no control over the inevitable buffering scheme required to support such awkwardness.

We need to figure out all this stuff together, but so far I'm not at all convinced that seekable ranges are awkward.


Andrei
April 08, 2010
On Thu, Apr 08, 2010 at 11:23:29AM -0700, Steve Schveighoffer wrote:
> I think we can probably come up with an abstraction layer that uses FILE* only when dealing with standard handles.

I've got an idea; let me throw it out here and see what you all think.

Let's say File (or some internal component of it) was changed to be a template, taking one of three types: FILE*, int, or ubyte[]/some ubyte returning range.

The one taking FILE* does what we have now. The one taking the int wraps the low level operating system handle, and ubyte does it for memory.

Let me give an example:

struct FileImp(BASE) {
    T[] rawRead(T)(T[] buffer)
    {
        enforce(buffer.length);
	static if(is(BASE == FILE*))
            invariant result =
                .fread(buffer.ptr, T.sizeof, buffer.length, p.handle);
	else static if(is(BASE == int))
            invariant result =
                posix.read(p.handle, buffer.ptr, T.sizeof * buffer.length);
        else static if(is(BASE == ubyte[])) {
	    invariant size_t result;
	    if(p.contents.length < buffer.length) {
                buffer[0..p.contents.length] = p.contents;
		result = p.contents.length;
		p.contents.length = 0;
            } else {
                buffer[] = p.contents[0..buffer.length];
		result = buffer.length;
		p.contents = p.contents[buffer.length..$];
	    }
	} else static assert(0, "Unsupported operation on underlying file");

        errnoEnforce(!error);
        return result ? buffer[0 .. result] : null;
    }
}

The list of static ifs is really ugly. I'd prefer to put the primitives for each type together somewhere, but I'm not sure how to best do that.

Anyway, the ideal end result would look like this:

auto stdin = FileImp!(FILE*)(std.c.stdin);
auto socket = FileImp!(int)(openSocket("example.com", 80));
auto memory = FileImp!(ubyte[])(cast(ubyte[]) "hello, world"); // cast needed so
// it matches the template and so it doesn't try to call the filename constructor

Then, they all work the same way to the outside observer.

To keep the easy

       auto file = File("file.txt");

working, we can just:

alias FileImpl!(FILE*) File;

Or whatever underlying implementation ends up working the best for generic use.

File wouldn't be the same as a range, but it can take certain ones and give out a variety of them, so it is still pretty compatible with them while being able to do file specific stuff efficiently as well.

BTW, something I think is important is to have at least some capability of non-blocking calls, but this capability can be limited to just rawRead and rawWrite to be good enough for me.
April 08, 2010



----- Original Message ----
> From: Andrei Alexandrescu <andrei at erdani.com>
> 
> On 04/08/2010 01:23 PM, Steve Schveighoffer wrote:
>> The network
> socket is not a range, it's a File, and File does have
>> primitives
> such as rawWrite and rawRead, which we can add to and
>> 
> improve.
>> 
>> File offers ranges, but you're not required to
> use them.
> 
> That's not what I read from Walter's comment...
> He indicated that
> something like an e.g. zip library should take a range
> as input.
> This implies that all streams are shoehorned into range
> form.

If the zip library works with ranges, we can use it for
> transparently handling in-memory zip manipulation and also zip file manipulation.

Yes, from a library perspective, everything as a range works well.  The problem is, does the range interface lend itself well to things that need streams, like zip.  Basically, you didn't answer the 'if zip can use ranges' part.  That's the part I'm more concerned about.

>> Makes sense. I'm just a bit worried about stdio's
> poor buffering
>> interface. It only offers setvbuf(), which is quite
> opaque.
> 
> The only reason to use FILE * as the underlying
> implementation is to
> be compatible with C's (f)printf.  It makes
> sense that you only need
> that compatibility for printing to a standard
> handle.  I think we can
> probably come up with an abstraction layer
> that uses FILE* only when
> dealing with standard handles.

It's
> more than printf. There are several I/O routines in stdio, and all use FILE* for both input and output. If a D application mixes calls to C APIs that do I/O with stdin, stdout, and stderr, we need to take a stance on what should happen.

But I'm saying, the times where we need to intermingle with C are only for the standard handles, it seems that's what you're saying also, but you worded it in a way that makes it sound like you disagree with me...  Confused.

> I don't think that accurately represents what's going on. rawRead does need a fair amount of paraphernalia to work. For example:

// Consume input using rawRead
auto buffer = new
> ubyte[1024];
size_t read;
while ((read = input.rawRead(buffer).length)
> > 0) {
   auto usable = buffer[0 .. read];
   ... use usable
> ...
}

Not that elegant. Compare and contrast with:

// Consume
> input using a range
foreach (buffer; input.byChunk(1024)) {

> ... use buffer ...
}

// Consume input straight from a
> range
input.bufsize = 1024;
foreach (buffer; input) {
    ...
> use buffer ...
}

Yes, if your application processes 1024 bytes at a time, it is easier to use a range.  That's not the application I'm referring to.  The application I'm talking about is when you need to read a different amount of bytes per read, such as a varying length packet.  This is not an uncommon situation.

Let's look at that version with your range:

while(!input.empty())
{
   input.bufsize = numtoread;
   input.popFront();
   auto data = input.front();

   // process data.
}

and with File's rawRead:

ubyte buf[MAXSIZE];
ubyte[] data;
while((data = input.rawRead(buf[0..numtoread])).length)
{
   // process data.
}

And look, we can use the stack for buffering!  Plus, we don't have to worry about whether the data buffer will be overwritten, we control what buffer is used by the input object, so we can manage that less defensively.

Also, let's not forget that you can easily bolt an input range interface on top of a file interface (as evidenced by byChunk), but you can't do the opposite.  For example, reading a packet at a time from a network/file stream given a length can easily be implemented with a range on top of a File struct, but not easily with a range on top of a range.

> // read N bytes
> 
> source.bufsize = N;
> auto data = source.front();
> 
> source.popFront();

I think it's more often to want to consume stuff in a
> stream manner, as opposed to attempting to read some isolated bits. Ranges are optimized for the former.

So essentially, the idea is to double-buffer the data, once inside the range (to support the front/popFront regime) and once for your application, so you can build up enough "chunks" to read the data correctly?  I don't see how this moves us towards high performance.  One litmus test for this is, if whatever we come up with uses more than one buffer, it is not good enough.

> We need to figure out all this stuff together, but so far I'm not at all convinced that seekable ranges are awkward.

I may not have explained myself well, I don't have a big problem with seekable ranges for certain applications, I just don't think they are the primitive that should be used for all applications.

-Steve




April 08, 2010
On Thu, Apr 08, 2010 at 03:55:43PM -0400, Adam D. Ruppe wrote:
> alias FileImpl!(FILE*) File;

Problem with this: outside functions can't just take a File and expect it to work. It isn't a deal breaker, since these functions could be changed to simple templates, but it doesn't seem ideal.

Files based on different implementations being different types seems like a good
idea, since fclose() won't work on a range (for example), so assigning them to
each other won't work.

I guess it either has to be an interface and classes, or used as templates, to make this idea work.
April 08, 2010
Somehow the quotes got messed pretty badly during this exchange. I'll trim most of the content, please refer to the original message if you come to this later.

On 04/08/2010 03:01 PM, Steve Schveighoffer wrote:
> But I'm saying, the times where we need to intermingle with C are only for the standard handles, it seems that's what you're saying also, but you worded it in a way that makes it sound like you disagree with me...  Confused.

I worded it in a way that clarifies it's about more than printf.

> Yes, if your application processes 1024 bytes at a time, it is easier to use a range.  That's not the application I'm referring to.  The application I'm talking about is when you need to read a different amount of bytes per read, such as a varying length packet.  This is not an uncommon situation.

I think the best way to make progress is to compare the range I suggested with another one. For example, if having rawRead as a range primitive is necessary, by all means let's make rawRead part of it.

> Let's look at that version with your range:
>
> while(!input.empty()) { input.bufsize = numtoread; input.popFront();
> auto data = input.front();
>
> // process data. }
>
> and with File's rawRead:
>
> ubyte buf[MAXSIZE]; ubyte[] data; while((data =
> input.rawRead(buf[0..numtoread])).length) { // process data. }
>
> And look, we can use the stack for buffering!  Plus, we don't have to worry about whether the data buffer will be overwritten, we control what buffer is used by the input object, so we can manage that less defensively.

This is a weak argument. Buffer allocation is hardly a bottleneck for streaming application. The code above would not work in SafeD.

> Also, let's not forget that you can easily bolt an input range interface on top of a file interface (as evidenced by byChunk), but you can't do the opposite.  For example, reading a packet at a time from a network/file stream given a length can easily be implemented with a range on top of a File struct, but not easily with a range on top of a range.

Indeed. A range that adapts variable-length packets to fixed-length packets would need to know some more details. I'm not sure we need such an abstraction, but if we do, we can define it.

> So essentially, the idea is to double-buffer the data, once inside the range (to support the front/popFront regime) and once for your application, so you can build up enough "chunks" to read the data correctly?

We are in agreement that we shouldn't be doing that.

>> We need to figure out all this stuff together, but so far I'm not at all convinced that seekable ranges are awkward.
>
> I may not have explained myself well, I don't have a big problem with seekable ranges for certain applications, I just don't think they are the primitive that should be used for all applications.

I confess I'm not sure what you want and how to get from where we are to where you want us to be.


Andrei
April 08, 2010
On 04/08/2010 03:14 PM, Adam D. Ruppe wrote:
> On Thu, Apr 08, 2010 at 03:55:43PM -0400, Adam D. Ruppe wrote:
>> alias FileImpl!(FILE*) File;
>
> Problem with this: outside functions can't just take a File and expect it to work. It isn't a deal breaker, since these functions could be changed to simple templates, but it doesn't seem ideal.

I was about to write that. We'd be forcing client code to be templated. This is a hindrance that would need substantial benefit to justify it.

> Files based on different implementations being different types seems like a good
> idea, since fclose() won't work on a range (for example), so assigning them to
> each other won't work.

Let's not forget that File isn't a range. Let's call it a "stream handle" that can _offer_ a number of ranges with various capabilities (byLine, byDchar, byWchar, byChar, byChunk). I envision how we could add ranges such as byPacket (variable-length packets returned as they come off the wire) and byWhateverAbstractionIsNeededByZip.

> I guess it either has to be an interface and classes, or used as templates, to make this idea work.

In order to ensure refcounting and timely closing, File needs to be an object. It could use the pimpl idiom inside in conjunction with a class hierarchy.


Andrei
1 2
Next ›   Last »