Jump to page: 1 2
Thread overview
std.file.read implementation contest
Feb 16, 2009
Sean Kelly
Feb 17, 2009
Sean Kelly
Feb 17, 2009
bearophile
Feb 17, 2009
Denis Koroskin
Feb 17, 2009
Sean Kelly
February 16, 2009
Someone mentioned an old bug in std.file.read here:

http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Andrei


// Implementation 1
void[] read(string name)
{
    invariant fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
    cenforce(fd != -1, name);
    scope(exit) std.c.linux.linux.close(fd);

    struct_stat statbuf = void;
    cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

    void[] buf;
    auto size = statbuf.st_size;
    if (size == 0)
    {	/* The size could be 0 if the file is a device or a procFS file,
	 * so we just have to try reading it.
	 */
	int readsize = 1024;
	while (1)
	{
	    buf = GC.realloc(buf.ptr, size + readsize, GC.BlkAttr.NO_SCAN)[0 .. cast(int)size + readsize];
	    enforce(buf, "Out of memory");
	    scope(failure) delete buf;

	    auto toread = readsize;
	    while (toread)
	    {
		auto numread = std.c.linux.linux.read(fd, buf.ptr + size, toread);
		cenforce(numread != -1, name);
		size += numread;
		if (numread == 0)
		{   if (size == 0)			// it really was 0 size
			delete buf;			// don't need the buffer
		    return buf[0 .. size];		// end of file
		}
		toread -= numread;
	    }
	}
    }
    else
    {
	buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size];
	enforce(buf, "Out of memory");
	scope(failure) delete buf;

	cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name);

	return buf[0 .. size];
    }
}

// Implementation 2
void[] read(string name)
{
    immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
    cenforce(fd != -1, name);
    scope(exit) std.c.linux.linux.close(fd);

    struct_stat statbuf = void;
    cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

    immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;
    void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
        [0 .. initialAlloc];
    scope(failure) delete result;
    size_t size = 0;

    for (;;)
    {
        immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                result.length - size);
        cenforce(actual != actual.max, name);
        size += actual;
        if (size < result.length) break;
        auto newAlloc = size + 1024 * 4;
        result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
            [0 .. newAlloc];
    }

    return result[0 .. size];
}
February 16, 2009
"Andrei Alexandrescu" wrote
> Someone mentioned an old bug in std.file.read here:
>
> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/
>
> Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Implementation 2, save 1 bug:

> Andrei
>
>
> // Implementation 1
...

>     else
>     {
> buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size];
> enforce(buf, "Out of memory");
> scope(failure) delete buf;
>
> cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name);
>
> return buf[0 .. size];
>     }
> }

This doesn't work for /sys files which consistently say they are 4096 bytes large, even though they can be smaller or larger (encountered this with Tango also).

I think you have to kill the check that the size read actually equals the size in stat, AND you have to continue reading even after you reach that size.

>
> // Implementation 2
> void[] read(string name)
> {
>     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
>     cenforce(fd != -1, name);
>     scope(exit) std.c.linux.linux.close(fd);
>
>     struct_stat statbuf = void;
>     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);
>
>     immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;

Excellent idea, this allocates the minimum amount of space to tell if a file is larger than the file size returned in fstat.  This should handle the /sys case and /proc case properly.

>     void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
>         [0 .. initialAlloc];
>     scope(failure) delete result;
>     size_t size = 0;
>
>     for (;;)
>     {
>         immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
>                 result.length - size);
>         cenforce(actual != actual.max, name);
>         size += actual;
>         if (size < result.length) break;

Only bug:
You need to check if actual > 0.  A driver is allowed to return less data
than requested, even if there is more to read.  You must read until read
returns 0.  In this case you need a loop to do the reading.  Yes the disk
driver never returns less than requested unless EOF is reached, but the
/proc files are run by any driver, which might do something like that.

>         auto newAlloc = size + 1024 * 4;
>         result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
>             [0 .. newAlloc];
>     }
>
>     return result[0 .. size];
> }

I would rewrite:

void[] read(string name)
{
    immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
    cenforce(fd != -1, name);
    scope(exit) std.c.linux.linux.close(fd);

    struct_stat statbuf = void;
    cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

    auto allocSize = statbuf.st_size ? statbuf.st_size + 1 : 1024;
    void[] result;
    scope(failure) delete result;
    size_t size = 0;

outer:
    while(true)
    {
        result = GC.realloc(result.ptr, allocSize, GC.BlkAttr.NO_SCAN)
            [0 .. allocSize];
        while(size < allocSize)
        {
            immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                    allocSize - size);
            if(actual == 0)
                break outer;
            cenforce(actual != actual.max, name);
            size += actual;
        }
        allocSize = size + 1024 * 4;
    }

    return result[0 .. size];
}


February 16, 2009
Steven Schveighoffer wrote:
> "Andrei Alexandrescu" wrote
>> Someone mentioned an old bug in std.file.read here:
>>
>> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/
>>
>> Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)
> 
> Implementation 2, save 1 bug:
[snip]

Aha, cool. Thanks for the info. I've adapted the code to still only use one loop:

void[] read(string name)
{
    immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
    cenforce(fd != -1, name);
    scope(exit) std.c.linux.linux.close(fd);

    struct_stat statbuf = void;
    cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

    immutable initialAlloc = statbuf.st_size
        ? (statbuf.st_size + 16) & 15
        : 1024;
    void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
        [0 .. initialAlloc];
    scope(failure) delete result;
    size_t size = 0;

    for (;;)
    {
        immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                result.length - size);
        cenforce(actual != -1, name);
        if (actual == 0) break;
        size += actual;
        if (size < result.length) continue;
        auto newAlloc = size + 1024 * 4;
        result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
            [0 .. newAlloc];
    }

    return result[0 .. size];
}

One more improvement suggested by Walter - I allocate a multiple of 16 because that's allocator's granularity.


Andrei
February 16, 2009
"Andrei Alexandrescu" wrote
> Steven Schveighoffer wrote:
>> "Andrei Alexandrescu" wrote
>>> Someone mentioned an old bug in std.file.read here:
>>>
>>> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/
>>>
>>> Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)
>>
>> Implementation 2, save 1 bug:
> [snip]
>
> Aha, cool. Thanks for the info. I've adapted the code to still only use one loop:
>
> void[] read(string name)
> {
>     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
>     cenforce(fd != -1, name);
>     scope(exit) std.c.linux.linux.close(fd);
>
>     struct_stat statbuf = void;
>     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);
>
>     immutable initialAlloc = statbuf.st_size
>         ? (statbuf.st_size + 16) & 15
>         : 1024;

Hm... won't this allocate only 15 bytes max if st_size is nonzero?

I think you meant:
(statbuf.st_size + 16) & ~15

Two more points:

if you allocate a size of 16, I think you'll actually get 32 bytes because of the sentinel byte.  Somehow you should account for that, or you automatically double the allocation size each time (or at least a page more than you want).

And the increments are not in 16, they are in powers of 2 *starting* with 16.  For example, if you allocate a size of 80 bytes (16 * 5), you will actually get 128 bytes.

Other than that, it looks good.

-Steve


February 16, 2009
Andrei Alexandrescu wrote:
> Someone mentioned an old bug in std.file.read here:
> 
> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ 
> 
> 
> Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Neither one.  std.read is intended to read the contents of a file in bulk into memory.  If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case).  Streaming operations should be used on unbounded virtual files, not std.read.


Sean
February 16, 2009
"Sean Kelly" wrote
> Andrei Alexandrescu wrote:
>> Someone mentioned an old bug in std.file.read here:
>>
>> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)
>
> Neither one.  std.read is intended to read the contents of a file in bulk into memory.  If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case).  Streaming operations should be used on unbounded virtual files, not std.read.

Then you have just created for yourself a never ending bug report :)  I think reading /proc and /sys files is an essential part of writing scripts and useful utilities.  std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once.  If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available.  Complaining that *nix isn't pure enough for the likes of D isn't very practical.

I view the usage of fstat to get the file size as being an optimization, not a validation of the OS' sanity.

-Steve


February 16, 2009
Steven Schveighoffer wrote:
> "Andrei Alexandrescu" wrote
>> Steven Schveighoffer wrote:
>>> "Andrei Alexandrescu" wrote
>>>> Someone mentioned an old bug in std.file.read here:
>>>>
>>>> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/
>>>>
>>>> Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)
>>> Implementation 2, save 1 bug:
>> [snip]
>>
>> Aha, cool. Thanks for the info. I've adapted the code to still only use one loop:
>>
>> void[] read(string name)
>> {
>>     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
>>     cenforce(fd != -1, name);
>>     scope(exit) std.c.linux.linux.close(fd);
>>
>>     struct_stat statbuf = void;
>>     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);
>>
>>     immutable initialAlloc = statbuf.st_size
>>         ? (statbuf.st_size + 16) & 15
>>         : 1024;
> 
> Hm... won't this allocate only 15 bytes max if st_size is nonzero?
> 
> I think you meant:
> (statbuf.st_size + 16) & ~15

Stupid!!!

> Two more points:
> 
> if you allocate a size of 16, I think you'll actually get 32 bytes because of the sentinel byte.  Somehow you should account for that, or you automatically double the allocation size each time (or at least a page more than you want).
> 
> And the increments are not in 16, they are in powers of 2 *starting* with 16.  For example, if you allocate a size of 80 bytes (16 * 5), you will actually get 128 bytes.
> 
> Other than that, it looks good.

Guess I'll revert to +1. The effect is likely to be negligible anyway.

Thanks!


Andrei
February 16, 2009
Steven Schveighoffer wrote:
> "Sean Kelly" wrote
>> Andrei Alexandrescu wrote:
>>> Someone mentioned an old bug in std.file.read here:
>>>
>>> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)
>> Neither one.  std.read is intended to read the contents of a file in bulk into memory.  If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case).  Streaming operations should be used on unbounded virtual files, not std.read.
> 
> Then you have just created for yourself a never ending bug report :)  I think reading /proc and /sys files is an essential part of writing scripts and useful utilities.  std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once.  If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available.  Complaining that *nix isn't pure enough for the likes of D isn't very practical.
> 
> I view the usage of fstat to get the file size as being an optimization, not a validation of the OS' sanity.

I totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it", not "if the size according to fstat is the same as the actual file size, /proc stuff and concurrent access against the file notwithstanding, reads exactly that many bytes in a buffer and returns it. Otherwise, leaves you scratching your head." And yes, using fstat is just a little irrelevant implementation trick, not something that conditions the workings of read.

Andrei
February 17, 2009
Steven Schveighoffer wrote:
> "Sean Kelly" wrote
>> Andrei Alexandrescu wrote:
>>> Someone mentioned an old bug in std.file.read here:
>>>
>>> http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/ Two programmers sent in patches for the function. Which is to be committed and why? (Linux versions shown. Apologies for noisy line breaks.)
>> Neither one.  std.read is intended to read the contents of a file in bulk into memory.  If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case).  Streaming operations should be used on unbounded virtual files, not std.read.
> 
> Then you have just created for yourself a never ending bug report :)  I think reading /proc and /sys files is an essential part of writing scripts and useful utilities.  std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once.  If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available.  Complaining that *nix isn't pure enough for the likes of D isn't very practical.

I wasn't saying that at all.  But try using this approach on /dev/random, for example.  One of the great things about *nix is that _anything_ can be represented as a file and therefore operated on via the same scripting techniques, however, I think it's an open issue whether this behavior lies within or outside the design of std.read.

I do think one could argue that std.read should make a best effort and read from such files up to some imposed maximum byte count, but this still seems like stretching its intended purpose to me.


Sean
February 17, 2009
Andrei Alexandrescu wrote:
> 
> I totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it"

Okay, that's a fair definition.  So the correct behavior for reading an unbounded stream should be an out of memory error?  This would be entirely reasonable, but any failure conditions should be described as well.
« First   ‹ Prev
1 2