Thread overview
[Issue 4474] New: Safer stdin.byLine()
[Issue 4474] Better stdin.byLine()
July 16, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474

           Summary: Safer stdin.byLine()
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: bearophile_hugs@eml.cc


--- Comment #0 from bearophile_hugs@eml.cc 2010-07-16 16:21:32 PDT ---
This is relative to page 16-17 of The D Programming Language. It explains stdin.byLine() and possible 'rather hard to find' bugs caused by not duplicating the input data.

If I use D to write 20-lines long scripts I really don't want to remember to dup all things (in D1 code I sometimes end up dupping too much, to be on the safe side). So I suggest a different API for the line reading:

- stdin.byLineMutable() (or another similar name, longer than "byLine" that makes it clear it doesn't copy): for the current behaviour that avoids a memory allocation for each line read. This is faster but it's less safe.

- stdin.byLine(): that allocates a new string for each line, this is safer, as in Python (Python also uses heuristics to speed up this method as much as possible, because this is often a very common and performance-critical operation in scripts).

All D default design policy says that unsafe but faster things need to be asked for, and the default things must be less bug-prone. If I write a small D script I can use byLine(), hoping to avoid some bugs. If later I see profiling shows me it's too much slow, I can replace the byLine() with the other method and optimize the code, carefully, removing some heap allocations.

(An alternative design strategy is to keep just the byLine() method, but give
it an optional default argument, like stdin.byLine(bool copy=True) or
stdin.byLine(bool COPY=True)(), that on default copies the line with a new
memory allocation.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 17, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrei@metalanguage.com


--- Comment #1 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-07-17 08:00:52 PDT ---
byLine is safe.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 17, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474



--- Comment #2 from bearophile_hugs@eml.cc 2010-07-17 08:29:20 PDT ---
OK, changed title in "Better" instead of "Safer".

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 17, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474



--- Comment #3 from bearophile_hugs@eml.cc 2010-07-17 09:10:49 PDT ---
This is a small test program (dmd v2.047):

import std.string, std.stdio;
void main() {
    int[string] aa;
    foreach (line; stdin.byLine())
        foreach (word; line.split())
            aa[word]++;
    foreach (word, freq; aa)
        writeln(freq, " ", word);
}


Running with itself as input data:
test.exe < test.d


Prints:
1 eln(fr
1 q, " ", wo
1     writeln
1 }
1 "
1 }
1 }
1 writeln
2     wri
1    wri
1  ", word);
))
1 , w
1 q, " ", word);

1 eln(fr
1 q, "
1 freq,
1 ",
1 eln(freq, "
1  writeln(fr
1 word);
1 writeln(freq,
1 fre
1 e


This shows that byLine() is bug-prone (unsafe).


While this program:

import std.string, std.stdio;
void main() {
    int[string] aa;
    foreach (line; stdin.byLine())
        foreach (word; line.split())
            aa[word.dup]++;
    foreach (word, freq; aa)
        writeln(freq, " ", word);
}


Prints a more correct output:

1 (word,
1 std.stdio;
1 int[string]
1 }
1 "
1 void
1 import
3 foreach
1 main()
1 aa)
1 line.split())
1 stdin.byLine())
1 (line;
1 freq;
1 (word;
1 ",
1 std.string,
1 word);
1 writeln(freq,
1 aa[word.dup]++;
1 aa;
1 {


It's easy to forget dupping/idupping.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 17, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474



--- Comment #4 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-07-17 11:06:02 PDT ---
That example is the manifestation of another bug:

http://d.puremagic.com/issues/show_bug.cgi?id=2954

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 17, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474



--- Comment #5 from bearophile_hugs@eml.cc 2010-07-17 11:46:28 PDT ---
If you think this bug report is invalid and byLine() is safe (because the type
system is enough, being able to tell apart char[] and string), then you can
close this bug report.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 25, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4474


bearophile_hugs@eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


--- Comment #6 from bearophile_hugs@eml.cc 2010-07-24 19:07:33 PDT ---
Bug closed because Andrei says byLine() is safe :-)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------