Thread overview
Re: [your code here]
Feb 18, 2012
H. S. Teoh
Feb 18, 2012
Timon Gehr
Re: File Api [Was: [your code here]]
Feb 18, 2012
H. S. Teoh
Feb 18, 2012
Brad Anderson
Feb 18, 2012
H. S. Teoh
February 18, 2012
On Sat, Feb 18, 2012 at 02:52:15AM +0100, Alf P. Steinbach wrote:
> On 18.02.2012 02:39, H. S. Teoh wrote:
> >// Outputs a randomly selected line from standard input with equal
> >// likelihood.
> >import std.random;
> >import std.stdio;
> >
> >void main() {
> >	auto n = 0;
> >	string choice;
> >	foreach (line; stdin.byLine()) {
> >		n++;
> >		if (uniform(0,n) == 0)
> >			choice = line.idup;
> >	}
> >	writeln(choice);
> >}
> >
> >
> >P.S. Proof that the probability any line is selected is exactly 1/n (where n is the total number of lines read) is left as an exercise for the reader. ;-)
> 
> Assuming that by "any" you mean "any particular", you would have to read all the lines first. Otherwise, if the code selects the first line with probability 1/K, then I can just input some other number of lines.

But therein lies the trick. The algorithm self-adapts to the number of lines it reads. It does not know in advance how many lines there are, but guarantees that by the end of the input, the probability that any particular line is selected is exactly 1/n, where n is the number of lines read. The key lies in the way uniform() is called. :)

(This has been tested by running the program repeatedly on the same input and analysing the number of occurrences of each line. The proof of the algorithm is left as an exercise for the reader. ;-) )


> >P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
> >calls readln() with a static buffer, so choice will be silently
> >overwritten if the .idup is omitted.
> 
> That sounds ominous. One should never have to be aware of low level details in order to do simple string assignment or initialization, when the source already is a string. Does one really have to do that in D?
[...]

Well, there are two issues here. One is that stdin.byLine() returns char[], which cannot be assigned to string directly. Two is that the first version of the code didn't have the .idup (I used char[] instead of string for choice), and the result was garbage in the output due to said overwriting of buffer.

Now I agree that one shouldn't need to know how byLine() is implemented in order to be able to use it, but this is the way the current Phobos implementation is, unfortunately.

However, the one upside to all this is that the type system, in a sense, forces you to do the right thing: byLine() returns char[] which cannot be assigned to string, so naturally you have to write .idup, which automatically also avoids the buffer overwriting issue. Using string instead of char[] is a logical choice in this case (pun intended ;-)), because it's something whose value you want to retain until the next time it's modified. So using string for 'choice' naturally leads to needing an .idup in the loop. The system isn't perfect, but it's pretty good.


T

-- 
Political correctness: socially-sanctioned hypocrisy.
February 18, 2012
On 02/18/2012 03:16 AM, H. S. Teoh wrote:
> On Sat, Feb 18, 2012 at 02:52:15AM +0100, Alf P. Steinbach wrote:
>> On 18.02.2012 02:39, H. S. Teoh wrote:
>>> // Outputs a randomly selected line from standard input with equal
>>> // likelihood.
>>> import std.random;
>>> import std.stdio;
>>>
>>> void main() {
>>> 	auto n = 0;
>>> 	string choice;
>>> 	foreach (line; stdin.byLine()) {
>>> 		n++;
>>> 		if (uniform(0,n) == 0)
>>> 			choice = line.idup;
>>> 	}
>>> 	writeln(choice);
>>> }
>>>
>>>
>>> P.S. Proof that the probability any line is selected is exactly 1/n
>>> (where n is the total number of lines read) is left as an exercise
>>> for the reader. ;-)
>>
>> Assuming that by "any" you mean "any particular", you would have to
>> read all the lines first. Otherwise, if the code selects the first
>> line with probability 1/K, then I can just input some other number of
>> lines.
>
> But therein lies the trick. The algorithm self-adapts to the number of
> lines it reads. It does not know in advance how many lines there are,
> but guarantees that by the end of the input, the probability that any
> particular line is selected is exactly 1/n, where n is the number of
> lines read. The key lies in the way uniform() is called. :)
>
> (This has been tested by running the program repeatedly on the same
> input and analysing the number of occurrences of each line. The proof of
> the algorithm is left as an exercise for the reader. ;-) )
>
>
>>> P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
>>> calls readln() with a static buffer, so choice will be silently
>>> overwritten if the .idup is omitted.
>>
>> That sounds ominous. One should never have to be aware of low level
>> details in order to do simple string assignment or initialization,
>> when the source already is a string. Does one really have to do that
>> in D?
> [...]
>
> Well, there are two issues here. One is that stdin.byLine() returns
> char[], which cannot be assigned to string directly. Two is that the
> first version of the code didn't have the .idup (I used char[] instead
> of string for choice), and the result was garbage in the output due to
> said overwriting of buffer.
>
> Now I agree that one shouldn't need to know how byLine() is implemented
> in order to be able to use it, but this is the way the current Phobos
> implementation is, unfortunately.
>

You don't need to know how it is implemented. Everything you need to know is stated in its interface + documentation comment.

> However, the one upside to all this is that the type system, in a sense,
> forces you to do the right thing: byLine() returns char[] which cannot
> be assigned to string, so naturally you have to write .idup, which
> automatically also avoids the buffer overwriting issue. Using string
> instead of char[] is a logical choice in this case (pun intended ;-)),
> because it's something whose value you want to retain until the next
> time it's modified. So using string for 'choice' naturally leads to
> needing an .idup in the loop. The system isn't perfect, but it's pretty
> good.
>
>
> T
>

Its pretty perfect. If byLine would return string it would be horribly inefficient for the common case of processing the input without storing it. It even allows in-place modification of the current input line.
February 18, 2012
On Fri, Feb 17, 2012 at 09:20:59PM -0500, bearophile wrote:
> H. S. Teoh:
> 
> > P.S.S. The .idup is a bit ugly, but necessary, since apparently
> > byLine() calls readln() with a static buffer, so choice will be
> > silently overwritten if the .idup is omitted.
> 
> An alternative File API that to me looks nice. This is a first part, it's for scripting-like or not high performance purposes, it looks essentially like Python code, every line is a newly allocated string:
> 
> import std.stdio;
> void main() {
>     string[] lines;
>     foreach (line; File("data.dat")) {
>         static assert(is(line == string));
>         lines ~= line;
>     }
> }

I don't think it's a good idea to have File() automatically iterate by lines. What if data.dat is binary? I think it's still better to have:

	foreach (line; File("...").lines()) { ... }

where File.lines() is a lazy method that reads the file line-by-line.

For binary files, you could have File.byChunk!T() where T can be any type, struct, etc., with the appropriate method for construction on input.


> If you don't want a new buffer every line you use something like this:
> 
> import std.stdio;
> void main() {
>     string[] lines;
>     foreach (line; File("data.dat").fastLines()) {
>         static assert(is(line == char[]));
>         lines ~= line.idup;
>     }
> }
> 
> So on default it's handy, short and safe, and with a method you avoid an allocation every line and get a mutable char[].

Makes sense.


> Maybe even this works, downloads a page and scans its lines, but maybe
> it's better to add a bit of extra safety to this:
> foreach (line; File("http://www.dlang.org/faq.html")) {}
[...]

I don't know about this, I would prefer downloading files to be in another class, not File. As long as they provide a Range interface it's good enough. Perhaps File and NetworkFile (or whatever you want to call it) can implement the same interface, .lines() for safe iteration, .fastLines() for buffered iteration.

Anyway, is anyone working on a new file I/O interface? I'd like to see std.stream and std.stdio combined, for one thing. Or replaced with a range-based API.


T

-- 
Frank disagreement binds closer than feigned agreement.
February 18, 2012
On Saturday, 18 February 2012 at 02:46:52 UTC, H. S. Teoh wrote:
> On Fri, Feb 17, 2012 at 09:20:59PM -0500, bearophile wrote:
>> H. S. Teoh:
>> 
>> > P.S.S. The .idup is a bit ugly, but necessary, since apparently
>> > byLine() calls readln() with a static buffer, so choice will be
>> > silently overwritten if the .idup is omitted.
>> 
>> An alternative File API that to me looks nice. This is a first part,
>> it's for scripting-like or not high performance purposes, it looks
>> essentially like Python code, every line is a newly allocated string:
>> 
>> import std.stdio;
>> void main() {
>>     string[] lines;
>>     foreach (line; File("data.dat")) {
>>         static assert(is(line == string));
>>         lines ~= line;
>>     }
>> }
>
> I don't think it's a good idea to have File() automatically iterate by
> lines. What if data.dat is binary? I think it's still better to have:
>
> 	foreach (line; File("...").lines()) { ... }
>
> where File.lines() is a lazy method that reads the file line-by-line.
>
> For binary files, you could have File.byChunk!T() where T can be any
> type, struct, etc., with the appropriate method for construction on
> input.
>
>
>> If you don't want a new buffer every line you use something like this:
>> 
>> import std.stdio;
>> void main() {
>>     string[] lines;
>>     foreach (line; File("data.dat").fastLines()) {
>>         static assert(is(line == char[]));
>>         lines ~= line.idup;
>>     }
>> }
>> 
>> So on default it's handy, short and safe, and with a method you avoid
>> an allocation every line and get a mutable char[].
>
> Makes sense.
>
>
>> Maybe even this works, downloads a page and scans its lines, but maybe
>> it's better to add a bit of extra safety to this:
>> foreach (line; File("http://www.dlang.org/faq.html")) {}
> [...]
>
> I don't know about this, I would prefer downloading files to be in
> another class, not File. As long as they provide a Range interface it's
> good enough. Perhaps File and NetworkFile (or whatever you want to call
> it) can implement the same interface, .lines() for safe iteration,
> .fastLines() for buffered iteration.
>
> Anyway, is anyone working on a new file I/O interface? I'd like to see
> std.stream and std.stdio combined, for one thing. Or replaced with a
> range-based API.
>
>
> T

Steve Schveighoffer's work in progress on std.io:

https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Regards,
Brad Anderson
February 18, 2012
On Sat, Feb 18, 2012 at 04:41:09AM +0100, Brad Anderson wrote:
> On Saturday, 18 February 2012 at 02:46:52 UTC, H. S. Teoh wrote:
[...]
> >Anyway, is anyone working on a new file I/O interface? I'd like to see std.stream and std.stdio combined, for one thing. Or replaced with a range-based API.
[...]
> Steve Schveighoffer's work in progress on std.io:
> 
> https://github.com/schveiguy/phobos/blob/new-io/std/io.d
[...]

Hmmm, I browsed through the code, and I like what I see. I'd like to see string streams implemented eventually. That would be very useful.


T

-- 
"You are a very disagreeable person." "NO."