Thread overview
[phobos] readlnImpl (Again)
Apr 19, 2010
David Simcha
Apr 22, 2010
David Simcha
April 19, 2010
I've begun to realize that Steve's array stomping patch breaks some assumptions that std.stdio.readlnImpl made about how array appending works, leading to it being slower than molasses again.  Should I just put some assumeSafeAppend() calls in there to make it behave as it used to, even though the old behavior was unsafe, or do we want to completely redesign this code now that we want to guarantee no stomping?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100419/4437985c/attachment.html>
April 19, 2010
This is a tricky situation.  I don't think assumeSafeAppend is the right call.  Imagine something like this:

char[] data = new char[50];
data[40..45] = "hello";

f.readln(data[0..40]);

now, if the line extends past 40 characters, it will overwrite data[40..45].

What I think the safe thing to do is to not use appending until the buffer is exhausted.  However, even in that case, I don't think append is the right move.  Appending one char at a time, while maybe faster when the buffer can append in place, is still extremely slow compared to just setting a char in an array.  A better solution is to simply increment the length of the buffer when needed.  A handy function from the runtime would extend an array to the full capacity it could contain.

A quick (uncompiled) version of the relevant part:

                uint used = 0;
                char[4] encbuf;
                uint enclen;
                for (int c; (c = FGETWC(fp)) != -1; )
                {
                    if ((c & ~0x7F) == 0)
                    {
                        enclen = 1;
                        encbuf[0] = cast(char)c;
                    }
                    else
                    {
                        enclen = std.utf.encode(encbuf, cast(dchar)c);
                    }
                    if(used + enclen > buf.length)
                    {
                        /* this should be a runtime function */
                        buf.length = used + enclen;
                        buf.length = buf.capacity;
                    }
                    buf[used..used+enclen] = encbuf[0..enclen];
                    if (c == terminator)
                        break;
                }
                if (ferror(fps))
                    StdioException();
                return used;


This is where a D-specific buffering solution would be extremely useful.  Instead of using FGETC, you could simply gain access to the buffer, and search for the terminator using std.algorithm.  Tango uses this kind of thing.

-Steve


>
>From: David Simcha <dsimcha at gmail.com>
>To: Discuss the phobos library for D <phobos at puremagic.com>
>Sent: Mon, April 19, 2010 11:30:43 AM
>Subject: [phobos] readlnImpl (Again)
>
>I've begun to realize that Steve's array stomping patch breaks some assumptions that std.stdio.readlnImpl made about how array appending works, leading to it being slower than molasses again.  Should I just put some assumeSafeAppend() calls in there to make it behave as it used to, even though the old behavior was unsafe, or do we want to completely redesign this code now that we want to guarantee no stomping?
>



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100419/26b2a7c1/attachment-0001.html>
April 22, 2010
Guys, I'll let you make sure this is taken care of; I'm busy with the final touches for TDPL. Once I let it out of my hand, that will really be it.

Just a brief thought - I hope std.array.Appender still works with the new allocation regime. If it does, maybe it could be put to use for readln.


Andrei

On 04/19/2010 11:32 AM, Steve Schveighoffer wrote:
> This is a tricky situation.  I don't think assumeSafeAppend is the right call. Imagine something like this:
>
> char[] data = new char[50];
> data[40..45] = "hello";
>
> f.readln(data[0..40]);
>
> now, if the line extends past 40 characters, it will overwrite data[40..45].
>
> What I think the safe thing to do is to not use appending until the buffer is exhausted. However, even in that case, I don't think append is the right move. Appending one char at a time, while maybe faster when the buffer can append in place, is still extremely slow compared to just setting a char in an array. A better solution is to simply increment the length of the buffer when needed. A handy function from the runtime would extend an array to the full capacity it could contain.
>
> A quick (uncompiled) version of the relevant part:
>
> uint used = 0;
> char[4] encbuf;
> uint enclen;
> for (int c; (c = FGETWC(fp)) != -1; )
> {
> if ((c & ~0x7F) == 0)
> {
> enclen = 1;
> encbuf[0] = cast(char)c;
> }
> else
> {
> enclen = std.utf.encode(encbuf, cast(dchar)c);
> }
> if(used + enclen > buf.length)
> {
> /* this should be a runtime function */
> buf.length = used + enclen;
> buf.length = buf.capacity;
> }
> buf[used..used+enclen] = encbuf[0..enclen];
> if (c == terminator)
> break;
> }
> if (ferror(fps))
> StdioException();
> return used;
>
>
> This is where a D-specific buffering solution would be extremely useful. Instead of using FGETC, you could simply gain access to the buffer, and search for the terminator using std.algorithm. Tango uses this kind of thing.
>
> -Steve
>
>
>     *From:* David Simcha <dsimcha at gmail.com>
>     *To:* Discuss the phobos library for D <phobos at puremagic.com>
>     *Sent:* Mon, April 19, 2010 11:30:43 AM
>     *Subject:* [phobos] readlnImpl (Again)
>
>     I've begun to realize that Steve's array stomping patch breaks some
>     assumptions that std.stdio.readlnImpl made about how array appending
>     works, leading to it being slower than molasses again. Should I just
>     put some assumeSafeAppend() calls in there to make it behave as it
>     used to, even though the old behavior was unsafe, or do we want to
>     completely redesign this code now that we want to guarantee no stomping?
>
>
>
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
April 22, 2010
I was already using std.array.Appender since I fixed this the first time.  I added the assumeSafeAppend calls in my own source tree, but I didn't commit it both because I wanted input on whether we still want to allow stomping and because for some reason its still horribly slow.  Whether Appender still does what it's supposed to is a good question.

On Thu, Apr 22, 2010 at 12:44 PM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> Guys, I'll let you make sure this is taken care of; I'm busy with the final touches for TDPL. Once I let it out of my hand, that will really be it.
>
> Just a brief thought - I hope std.array.Appender still works with the new allocation regime. If it does, maybe it could be put to use for readln.
>
>
> Andrei
>
>
> On 04/19/2010 11:32 AM, Steve Schveighoffer wrote:
>
>> This is a tricky situation.  I don't think assumeSafeAppend is the right call. Imagine something like this:
>>
>> char[] data = new char[50];
>> data[40..45] = "hello";
>>
>> f.readln(data[0..40]);
>>
>> now, if the line extends past 40 characters, it will overwrite data[40..45].
>>
>> What I think the safe thing to do is to not use appending until the buffer is exhausted. However, even in that case, I don't think append is the right move. Appending one char at a time, while maybe faster when the buffer can append in place, is still extremely slow compared to just setting a char in an array. A better solution is to simply increment the length of the buffer when needed. A handy function from the runtime would extend an array to the full capacity it could contain.
>>
>> A quick (uncompiled) version of the relevant part:
>>
>> uint used = 0;
>> char[4] encbuf;
>> uint enclen;
>> for (int c; (c = FGETWC(fp)) != -1; )
>> {
>> if ((c & ~0x7F) == 0)
>> {
>> enclen = 1;
>> encbuf[0] = cast(char)c;
>> }
>> else
>> {
>> enclen = std.utf.encode(encbuf, cast(dchar)c);
>> }
>> if(used + enclen > buf.length)
>> {
>> /* this should be a runtime function */
>> buf.length = used + enclen;
>> buf.length = buf.capacity;
>> }
>> buf[used..used+enclen] = encbuf[0..enclen];
>> if (c == terminator)
>> break;
>> }
>> if (ferror(fps))
>> StdioException();
>> return used;
>>
>>
>> This is where a D-specific buffering solution would be extremely useful. Instead of using FGETC, you could simply gain access to the buffer, and search for the terminator using std.algorithm. Tango uses this kind of thing.
>>
>> -Steve
>>
>>
>>    *From:* David Simcha <dsimcha at gmail.com>
>>    *To:* Discuss the phobos library for D <phobos at puremagic.com>
>>    *Sent:* Mon, April 19, 2010 11:30:43 AM
>>    *Subject:* [phobos] readlnImpl (Again)
>>
>>    I've begun to realize that Steve's array stomping patch breaks some
>>    assumptions that std.stdio.readlnImpl made about how array appending
>>    works, leading to it being slower than molasses again. Should I just
>>    put some assumeSafeAppend() calls in there to make it behave as it
>>    used to, even though the old behavior was unsafe, or do we want to
>>    completely redesign this code now that we want to guarantee no
>> stomping?
>>
>>
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100422/3f504a1f/attachment.html>