Thread overview
[phobos] Bug 3763 (How to fix readlnImpl())
Feb 04, 2010
David Simcha
Feb 04, 2010
David Simcha
Feb 04, 2010
David Simcha
February 03, 2010
I recently filed bug 3673 (http://d.puremagic.com/issues/show_bug.cgi?id=3763) and started looking at possible fixes.  This is a bug in std.stdio.readlnImpl().  However, the more I think about it the more I think the function needs to be completely rethought, not just patched.  I'm not sure if this is a high priority given that it has nothing to do with the language spec and TDPL but it's a pretty embarrassing quality of implementation issue.

Anyhow, readlnImpl() takes a ref char[] and tries to recycle the memory to read in another line.  In doing so, it queries GC.capacity for the relevant memory block and resizes the array to the size of the memory block.  This is arguably unsafe in the general case because, if someone passes in a slice that starts at the beginning of a GC block to use as the buffer, everything else in the same GC block can get overwritten. This massively violates the principle of least surprise.  On the other hand, when encapsulated in the higher level API of byLine(), it's both safe and efficient.

A second issue is that the GCC_IO and GENERIC_IO versions use the following idiom:

buf.length = 0;
// append to buf

This, IIRC, isn't going to result in the memory block being recycled anymore once Steve's changes to the way arrays work are checked in.  The best thing to do IMHO is to solve the general case of this problem by adding a useExistingBuffer() function to std.array.Appender.  This would take an existing array of elements that must be mutable, set the Appender's capacity field to the array's length (NOT its capacity as reported by the GC, to avoid stomping on arrays that are not owned by the current scope), and then work like it does now.

Thoughts?
February 03, 2010
David Simcha wrote:
> I recently filed bug 3673 (http://d.puremagic.com/issues/show_bug.cgi?id=3763) and started looking at possible fixes.  This is a bug in std.stdio.readlnImpl().  However, the more I think about it the more I think the function needs to be completely rethought, not just patched.  I'm not sure if this is a high priority given that it has nothing to do with the language spec and TDPL but it's a pretty embarrassing quality of implementation issue.
> 
> Anyhow, readlnImpl() takes a ref char[] and tries to recycle the memory to read in another line.  In doing so, it queries GC.capacity for the relevant memory block and resizes the array to the size of the memory block.  This is arguably unsafe in the general case because, if someone passes in a slice that starts at the beginning of a GC block to use as the buffer, everything else in the same GC block can get overwritten. This massively violates the principle of least surprise.  On the other hand, when encapsulated in the higher level API of byLine(), it's both safe and efficient.

Could you please give an example?

> A second issue is that the GCC_IO and GENERIC_IO versions use the following idiom:
> 
> buf.length = 0;
> // append to buf
> 
> This, IIRC, isn't going to result in the memory block being recycled anymore once Steve's changes to the way arrays work are checked in.  The best thing to do IMHO is to solve the general case of this problem by adding a useExistingBuffer() function to std.array.Appender.  This would take an existing array of elements that must be mutable, set the Appender's capacity field to the array's length (NOT its capacity as reported by the GC, to avoid stomping on arrays that are not owned by the current scope), and then work like it does now.
> 
> Thoughts?

Appender should have had that flag for a while now. Would be great if you added it.


Thanks,

Andrei
February 04, 2010
On Thu, Feb 4, 2010 at 12:10 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> David Simcha wrote:
>
>> I recently filed bug 3673 (
>> http://d.puremagic.com/issues/show_bug.cgi?id=3763) and started looking
>> at possible fixes.  This is a bug in std.stdio.readlnImpl().  However, the
>> more I think about it the more I think the function needs to be completely
>> rethought, not just patched.  I'm not sure if this is a high priority given
>> that it has nothing to do with the language spec and TDPL but it's a pretty
>> embarrassing quality of implementation issue.
>>
>> Anyhow, readlnImpl() takes a ref char[] and tries to recycle the memory to read in another line.  In doing so, it queries GC.capacity for the relevant memory block and resizes the array to the size of the memory block.  This is arguably unsafe in the general case because, if someone passes in a slice that starts at the beginning of a GC block to use as the buffer, everything else in the same GC block can get overwritten.  This massively violates the principle of least surprise.  On the other hand, when encapsulated in the higher level API of byLine(), it's both safe and efficient.
>>
>
> Could you please give an example?


import std.stdio;

void main() {
    auto writer = File("foo.txt", "wb");
    foreach(i; 0..1000) {
        writer.write('a');
    }
    writer.writeln();
    writer.close();

    auto chars = new char[500];
    chars[] = 'b';
    auto buf = chars[0..100];

    auto reader = File("foo.txt", "rb");
    reader.readln(buf);
    writeln(chars[$ - 1]);  // a
    assert(chars[$ - 1] == 'b');  // FAILS.
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100204/0fb2c442/attachment.htm>
February 04, 2010
So it seems like due to the low-level APIs that readlnImpl calls, it flies under the radar of Steve's detection.

I think using an Appender with the useExistingBuffer primitive would work, but would be suboptimal - each read can only use as many characters as read the last time, which would trigger quite a few allocation (whenever the line size increases). Please let me know whether this assessment is correct.

A possible solution would be to cache the last buffer and its maximum length in static variables inside readlnImpl. I think it's reasonable to say that readln is the owner of the char[] and you shouldn't take portions of it and expect they won't be changed. However, it will not trump over existing arrays.

Andrei

David Simcha wrote:
> 
> 
> On Thu, Feb 4, 2010 at 12:10 AM, Andrei Alexandrescu <andrei at erdani.com <mailto:andrei at erdani.com>> wrote:
> 
>     David Simcha wrote:
> 
>         I recently filed bug 3673
>         (http://d.puremagic.com/issues/show_bug.cgi?id=3763) and started
>         looking at possible fixes.  This is a bug in
>         std.stdio.readlnImpl().  However, the more I think about it the
>         more I think the function needs to be completely rethought, not
>         just patched.  I'm not sure if this is a high priority given
>         that it has nothing to do with the language spec and TDPL but
>         it's a pretty embarrassing quality of implementation issue.
> 
>         Anyhow, readlnImpl() takes a ref char[] and tries to recycle the
>         memory to read in another line.  In doing so, it queries
>         GC.capacity for the relevant memory block and resizes the array
>         to the size of the memory block.  This is arguably unsafe in the
>         general case because, if someone passes in a slice that starts
>         at the beginning of a GC block to use as the buffer, everything
>         else in the same GC block can get overwritten.  This massively
>         violates the principle of least surprise.  On the other hand,
>         when encapsulated in the higher level API of byLine(), it's both
>         safe and efficient.
> 
> 
>     Could you please give an example?
> 
> 
> import std.stdio;
> 
> void main() {
>     auto writer = File("foo.txt", "wb");
>     foreach(i; 0..1000) {
>         writer.write('a');
>     }
>     writer.writeln();
>     writer.close();
> 
>     auto chars = new char[500];
>     chars[] = 'b';
>     auto buf = chars[0..100];
> 
>     auto reader = File("foo.txt", "rb");
>     reader.readln(buf);
>     writeln(chars[$ - 1]);  // a
>     assert(chars[$ - 1] == 'b');  // FAILS.
> }
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
February 04, 2010
The other thing is, how many people are really going to use readln(char[] buf) directly instead of using the version that returns a newly allocated string or using byLine()?  If they're using the version that returns a newly allocated string, this is a non-issue.  If they're using byLine(), then a lot of the logic for recycling buffers could be moved there, where it is safer because it is well-encapsulated.

On Thu, Feb 4, 2010 at 12:02 PM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> So it seems like due to the low-level APIs that readlnImpl calls, it flies under the radar of Steve's detection.
>
> I think using an Appender with the useExistingBuffer primitive would work, but would be suboptimal - each read can only use as many characters as read the last time, which would trigger quite a few allocation (whenever the line size increases). Please let me know whether this assessment is correct.
>
> A possible solution would be to cache the last buffer and its maximum length in static variables inside readlnImpl. I think it's reasonable to say that readln is the owner of the char[] and you shouldn't take portions of it and expect they won't be changed. However, it will not trump over existing arrays.
>
> Andrei
>
> David Simcha wrote:
>
>>
>>
>> On Thu, Feb 4, 2010 at 12:10 AM, Andrei Alexandrescu <andrei at erdani.com<mailto: andrei at erdani.com>> wrote:
>>
>>    David Simcha wrote:
>>
>>        I recently filed bug 3673
>>        (http://d.puremagic.com/issues/show_bug.cgi?id=3763) and started
>>        looking at possible fixes.  This is a bug in
>>        std.stdio.readlnImpl().  However, the more I think about it the
>>        more I think the function needs to be completely rethought, not
>>        just patched.  I'm not sure if this is a high priority given
>>        that it has nothing to do with the language spec and TDPL but
>>        it's a pretty embarrassing quality of implementation issue.
>>
>>        Anyhow, readlnImpl() takes a ref char[] and tries to recycle the
>>        memory to read in another line.  In doing so, it queries
>>        GC.capacity for the relevant memory block and resizes the array
>>        to the size of the memory block.  This is arguably unsafe in the
>>        general case because, if someone passes in a slice that starts
>>        at the beginning of a GC block to use as the buffer, everything
>>        else in the same GC block can get overwritten.  This massively
>>        violates the principle of least surprise.  On the other hand,
>>        when encapsulated in the higher level API of byLine(), it's both
>>        safe and efficient.
>>
>>
>>    Could you please give an example?
>>
>>  import std.stdio;
>>
>> void main() {
>>    auto writer = File("foo.txt", "wb");
>>    foreach(i; 0..1000) {
>>        writer.write('a');
>>    }
>>    writer.writeln();
>>    writer.close();
>>
>>    auto chars = new char[500];
>>    chars[] = 'b';
>>    auto buf = chars[0..100];
>>
>>    auto reader = File("foo.txt", "rb");
>>    reader.readln(buf);
>>    writeln(chars[$ - 1]);  // a
>>    assert(chars[$ - 1] == 'b');  // FAILS.
>> }
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100204/3b2ef485/attachment-0001.htm>
February 04, 2010
I think readln(ref char[] buf) is useful for the sake of low-level efficiency (irony of the current state of affairs notwithstanding). If I were to make an executive decision now, I'd say let's fix the efficiency issue on version(DMD_IO) and leave things as they are for the time being.

Andrei

David Simcha wrote:
> The other thing is, how many people are really going to use readln(char[] buf) directly instead of using the version that returns a newly allocated string or using byLine()?  If they're using the version that returns a newly allocated string, this is a non-issue.  If they're using byLine(), then a lot of the logic for recycling buffers could be moved there, where it is safer because it is well-encapsulated.
> 
> On Thu, Feb 4, 2010 at 12:02 PM, Andrei Alexandrescu <andrei at erdani.com <mailto:andrei at erdani.com>> wrote:
> 
>     So it seems like due to the low-level APIs that readlnImpl calls, it
>     flies under the radar of Steve's detection.
> 
>     I think using an Appender with the useExistingBuffer primitive would
>     work, but would be suboptimal - each read can only use as many
>     characters as read the last time, which would trigger quite a few
>     allocation (whenever the line size increases). Please let me know
>     whether this assessment is correct.
> 
>     A possible solution would be to cache the last buffer and its
>     maximum length in static variables inside readlnImpl. I think it's
>     reasonable to say that readln is the owner of the char[] and you
>     shouldn't take portions of it and expect they won't be changed.
>     However, it will not trump over existing arrays.
> 
>     Andrei
> 
>     David Simcha wrote:
> 
> 
> 
>         On Thu, Feb 4, 2010 at 12:10 AM, Andrei Alexandrescu
>         <andrei at erdani.com <mailto:andrei at erdani.com>
>         <mailto:andrei at erdani.com <mailto:andrei at erdani.com>>> wrote:
> 
>            David Simcha wrote:
> 
>                I recently filed bug 3673
>                (http://d.puremagic.com/issues/show_bug.cgi?id=3763) and
>         started
>                looking at possible fixes.  This is a bug in
>                std.stdio.readlnImpl().  However, the more I think about
>         it the
>                more I think the function needs to be completely
>         rethought, not
>                just patched.  I'm not sure if this is a high priority given
>                that it has nothing to do with the language spec and TDPL but
>                it's a pretty embarrassing quality of implementation issue.
> 
>                Anyhow, readlnImpl() takes a ref char[] and tries to
>         recycle the
>                memory to read in another line.  In doing so, it queries
>                GC.capacity for the relevant memory block and resizes the
>         array
>                to the size of the memory block.  This is arguably unsafe
>         in the
>                general case because, if someone passes in a slice that
>         starts
>                at the beginning of a GC block to use as the buffer,
>         everything
>                else in the same GC block can get overwritten.  This
>         massively
>                violates the principle of least surprise.  On the other hand,
>                when encapsulated in the higher level API of byLine(),
>         it's both
>                safe and efficient.
> 
> 
>            Could you please give an example?
> 
>          import std.stdio;
> 
>         void main() {
>            auto writer = File("foo.txt", "wb");
>            foreach(i; 0..1000) {
>                writer.write('a');
>            }
>            writer.writeln();
>            writer.close();
> 
>            auto chars = new char[500];
>            chars[] = 'b';
>            auto buf = chars[0..100];
> 
>            auto reader = File("foo.txt", "rb");
>            reader.readln(buf);
>            writeln(chars[$ - 1]);  // a
>            assert(chars[$ - 1] == 'b');  // FAILS.
>         }
> 
> 
>         ------------------------------------------------------------------------
> 
> 
>         _______________________________________________
>         phobos mailing list
>         phobos at puremagic.com <mailto:phobos at puremagic.com>
>         http://lists.puremagic.com/mailman/listinfo/phobos
> 
>     _______________________________________________
>     phobos mailing list
>     phobos at puremagic.com <mailto:phobos at puremagic.com>
>     http://lists.puremagic.com/mailman/listinfo/phobos
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos