Jump to page: 1 2
Thread overview
persistent byLine
Jul 22, 2013
Nick Treleaven
Jul 22, 2013
Brad Anderson
Jul 22, 2013
Jonathan M Davis
Jul 23, 2013
monarch_dodra
Jul 23, 2013
bearophile
Jul 23, 2013
Nick Treleaven
Jul 24, 2013
Nick Treleaven
Jul 25, 2013
Jonathan M Davis
Jul 26, 2013
Nick Treleaven
Jul 25, 2013
H. S. Teoh
Jul 25, 2013
Brad Anderson
Jul 27, 2013
Nick Treleaven
Jul 24, 2013
Jakob Ovrum
Jul 24, 2013
Peter Alexander
Jul 24, 2013
Jakob Ovrum
Jul 26, 2013
Nick Treleaven
July 22, 2013
Hi,
I remember an example in some slides by Walter which had this snippet (slightly simplified):

stdin.byLine.map!(l => l.idup).array

Someone commented in a reddit post that the idup part was not intuitive (can't find the link now, sorry).

I made a pull request to re-enable using byLine!(char, immutable char). (Note this compiled in the current release, but didn't work properly AFAICT. It did work by commit 97cec33^).

https://github.com/D-Programming-Language/phobos/pull/1418

Using that allows us to drop the map!(l => l.idup) part from the above snippet. The new syntax isn't much better, but it can also be more efficient (as it caches front). I have an idea how to improve the syntax, but I'll omit it for this post.

I've since thought that if most or all lines in a file need to be persistent, it may be more efficient to use readText(filename).splitLines, because that doesn't need to allocate for each line.

There are two enhancements for that approach:
1. readText should accept a File, not just a filename, so we can use stdin.
2. splitLines makes an array. It would be more flexible to have an input range created from a function e.g. lineSplitter.

With these enhancements, we could change the byLine docs to recommend using File.readText.lineSplitter if most lines need to be persistent.

Thoughts?
July 22, 2013
On Monday, 22 July 2013 at 12:08:06 UTC, Nick Treleaven wrote:
> Hi,
> I remember an example in some slides by Walter which had this snippet (slightly simplified):
>
> stdin.byLine.map!(l => l.idup).array
>
> Someone commented in a reddit post that the idup part was not intuitive (can't find the link now, sorry).
>
> I made a pull request to re-enable using byLine!(char, immutable char). (Note this compiled in the current release, but didn't work properly AFAICT. It did work by commit 97cec33^).
>
> https://github.com/D-Programming-Language/phobos/pull/1418
>
> Using that allows us to drop the map!(l => l.idup) part from the above snippet. The new syntax isn't much better, but it can also be more efficient (as it caches front). I have an idea how to improve the syntax, but I'll omit it for this post.
>
> I've since thought that if most or all lines in a file need to be persistent, it may be more efficient to use readText(filename).splitLines, because that doesn't need to allocate for each line.
>
> There are two enhancements for that approach:
> 1. readText should accept a File, not just a filename, so we can use stdin.
> 2. splitLines makes an array. It would be more flexible to have an input range created from a function e.g. lineSplitter.
>
> With these enhancements, we could change the byLine docs to recommend using File.readText.lineSplitter if most lines need to be persistent.
>
> Thoughts?

I remember seeing Walter's component programming example and feeling that the idup stuck out like a sore thumb.

I like the idea. Makes simple programs even simpler without sacrificing performance by changing byLines.  Ideally readText would take generic streams rather than Files but that will have to wait until std.io gets finished.
July 22, 2013
On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
> Hi,
> I remember an example in some slides by Walter which had this snippet
> (slightly simplified):
> 
> stdin.byLine.map!(l => l.idup).array
> 
> Someone commented in a reddit post that the idup part was not intuitive (can't find the link now, sorry).
> 
> I made a pull request to re-enable using byLine!(char, immutable char).
> (Note this compiled in the current release, but didn't work properly
> AFAICT. It did work by commit 97cec33^).
> 
> https://github.com/D-Programming-Language/phobos/pull/1418
> 
> Using that allows us to drop the map!(l => l.idup) part from the above snippet. The new syntax isn't much better, but it can also be more efficient (as it caches front). I have an idea how to improve the syntax, but I'll omit it for this post.

I agree with monarch in that we really shouldn't try and mess with byLine like this. It would just be cleaner to come up with a new function for this, though I confess that part of me thinks that it's better to just use map!(a => a.idup)(), because it avoids duplicating functionality. It is arguably a bit ugly though.

> I've since thought that if most or all lines in a file need to be persistent, it may be more efficient to use readText(filename).splitLines, because that doesn't need to allocate for each line.
> 
> There are two enhancements for that approach:
> 1. readText should accept a File, not just a filename, so we can use stdin.

I'm opposed to this. I don't think that std.file should be using std.stdio.File at all. What we really need is for std.io to be finished, which will revamp std.stdio and give us streams and the like. And std.file really is not designed around using stdin - and shouldn't be IMHO. It's for operating on actual files.

> 2. splitLines makes an array. It would be more flexible to have an input range created from a function e.g. lineSplitter.

splitter will do the job just fine as long as you don't care about /r/n - though we should arguably come up with a solution that works with /r/n (assuming that something in std.range or std.algorithm doesn't already do it, and I'm just not thinking of it at the moment).

- Jonathan M Davis
July 23, 2013
On Monday, 22 July 2013 at 23:28:46 UTC, Jonathan M Davis wrote:
> On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
>> Hi,
>> I remember an example in some slides by Walter which had this snippet
>> (slightly simplified):
>> 
>> stdin.byLine.map!(l => l.idup).array
>> 
>> Someone commented in a reddit post that the idup part was not intuitive
>> (can't find the link now, sorry).
>> 
>> I made a pull request to re-enable using byLine!(char, immutable char).
>> (Note this compiled in the current release, but didn't work properly
>> AFAICT. It did work by commit 97cec33^).
>> 
>> https://github.com/D-Programming-Language/phobos/pull/1418
>> 
>> Using that allows us to drop the map!(l => l.idup) part from the above
>> snippet. The new syntax isn't much better, but it can also be more
>> efficient (as it caches front). I have an idea how to improve the
>> syntax, but I'll omit it for this post.
>
> I agree with monarch in that we really shouldn't try and mess with byLine like
> this. It would just be cleaner to come up with a new function for this, though
> I confess that part of me thinks that it's better to just use map!(a =>
> a.idup)(), because it avoids duplicating functionality. It is arguably a bit
> ugly though.

It is also inefficient, if you pipe it to an algorithm that calls the same front more than once, you'll end up duping more than once. It could be mitigated with my proposed "cache" adaptor, but then, we'd have:
file.byLine().map!(a => a.idup)().cache();

It works, but it is horrible to look at.

I'm rather in favor of a named "byDupLine", which is self documenting, easy to use, and new/casual programmer friendly.
July 23, 2013
monarch_dodra:

> I'm rather in favor of a named "byDupLine", which is self documenting, easy to use, and new/casual programmer friendly.

See also (three years old discussion):
http://d.puremagic.com/issues/show_bug.cgi?id=4474

Bye,
bearophile
July 23, 2013
(resending from the forum, original didn't arrive for some reason)

On 23/07/2013 00:28, Jonathan M Davis wrote:
> On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
>> I made a pull request to re-enable using byLine!(char, immutable char).
>> (Note this compiled in the current release, but didn't work properly
>> AFAICT. It did work by commit 97cec33^).
>>
>> https://github.com/D-Programming-Language/phobos/pull/1418
>>
>> Using that allows us to drop the map!(l => l.idup) part from the above
>> snippet. The new syntax isn't much better, but it can also be more
>> efficient (as it caches front). I have an idea how to improve the
>> syntax, but I'll omit it for this post.
>
> I agree with monarch in that we really shouldn't try and mess with byLine like
> this. It would just be cleaner to come up with a new function for this, though
> I confess that part of me thinks that it's better to just use map!(a =>
> a.idup)(), because it avoids duplicating functionality. It is arguably a bit
> ugly though.

I think I'll close that PR then. I reiterate that the readText.splitter approach is perhaps usually more efficient than either byLine/map/idup or byLine!(char, immutable char). Unless e.g. byLineDup was implemented so it allocated more than one line at once.

>> I've since thought that if most or all lines in a file need to be
>> persistent, it may be more efficient to use
>> readText(filename).splitLines, because that doesn't need to allocate for
>> each line.
>>
>> There are two enhancements for that approach:
>> 1. readText should accept a File, not just a filename, so we can use stdin.
>
> I'm opposed to this. I don't think that std.file should be using std.stdio.File
> at all. What we really need is for std.io to be finished, which will revamp
> std.stdio and give us streams and the like. And std.file really is not designed
> around using stdin - and shouldn't be IMHO. It's for operating on actual files.

Yes, I meant add std.stdio.File.readText. Would that be OK to add now, or is std.io likely to be added relatively soon?

>> 2. splitLines makes an array. It would be more flexible to have an input
>> range created from a function e.g. lineSplitter.
>
> splitter will do the job just fine as long as you don't care about /r/n -

I currently use Windows ;-)

> though we should arguably come up with a solution that works with /r/n
> (assuming that something in std.range or std.algorithm doesn't already do it,
> and I'm just not thinking of it at the moment).

splitter does work:
"splitting\r\nlines\r\nworks\r\n!".splitter("\r\n").writeln();

I can live with that. However...

The nice thing about splitLines is that it doesn't care what kind of line endings are in the file, i.e. you don't need to tell it in advance. You don't have to pass it std.ascii.newline as a separator in order to get portability.

Portability should be the default. That's what I intend for lineSplitter, which IMO would be better than the readln/byLine specific terminator approach. It would handle all text files portably, even ones that don't have the official system line ending chars, by default.

lineSplitter would be useful in other ways, e.g. for counting lines in a string.

July 24, 2013
On 23/07/2013 17:13, Nick Treleaven wrote:
> On 23/07/2013 00:28, Jonathan M Davis wrote:
>> On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
>>> I made a pull request to re-enable using byLine!(char, immutable char).
>>> (Note this compiled in the current release, but didn't work properly
>>> AFAICT. It did work by commit 97cec33^).
>>>
>>> https://github.com/D-Programming-Language/phobos/pull/1418
>>>
>>> Using that allows us to drop the map!(l => l.idup) part from the above
>>> snippet. The new syntax isn't much better, but it can also be more
>>> efficient (as it caches front). I have an idea how to improve the
>>> syntax, but I'll omit it for this post.
>>
>> I agree with monarch in that we really shouldn't try and mess with
>> byLine like
>> this. It would just be cleaner to come up with a new function for
>> this, though
>> I confess that part of me thinks that it's better to just use map!(a =>
>> a.idup)(), because it avoids duplicating functionality. It is arguably
>> a bit
>> ugly though.
>
> I think I'll close that PR then. I reiterate that the readText.splitter
> approach is perhaps usually more efficient than either byLine/map/idup
> or byLine!(char, immutable char). Unless e.g. byLineDup was implemented
> so it allocated more than one line at once.

Although, the caller might not want all the file contents to hang around just because one line was needed, so I think byLineDup probably is needed too. It's good to have a range of options.

I might make a PR for byLineDup sometime, unless anyone beats me to it.

>>> I've since thought that if most or all lines in a file need to be
>>> persistent, it may be more efficient to use
>>> readText(filename).splitLines, because that doesn't need to allocate for
>>> each line.
>>>
>>> There are two enhancements for that approach:
>>> 1. readText should accept a File, not just a filename, so we can use
>>> stdin.
>>
>> I'm opposed to this. I don't think that std.file should be using
>> std.stdio.File
>> at all. What we really need is for std.io to be finished, which will
>> revamp
>> std.stdio and give us streams and the like. And std.file really is not
>> designed
>> around using stdin - and shouldn't be IMHO. It's for operating on
>> actual files.
>
> Yes, I meant add std.stdio.File.readText. Would that be OK to add now,
> or is std.io likely to be added relatively soon?

Also is there any info on std.io? I assume it's still worthwhile to update std.stdio docs and fix issues?
July 24, 2013
On Monday, 22 July 2013 at 23:28:46 UTC, Jonathan M Davis wrote:
> splitter will do the job just fine as long as you don't care about /r/n -
> though we should arguably come up with a solution that works with /r/n
> (assuming that something in std.range or std.algorithm doesn't already do it,
> and I'm just not thinking of it at the moment).
>
> - Jonathan M Davis

std.regex.splitter can handle newlines more flexibly, e.g.:

void main()
{
	import std.algorithm : equal;
	import std.regex : ctRegex, splitter;

	auto text = "one\ntwo\r\nthree";

	auto newlinePattern = ctRegex!"[\r\n]+";

	assert(text.splitter(newlinePattern).equal(["one", "two", "three"]));
}
July 24, 2013
On Wednesday, 24 July 2013 at 17:13:10 UTC, Jakob Ovrum wrote:
> 	auto newlinePattern = ctRegex!"[\r\n]+";

That will swallow empty lines.

July 24, 2013
On Wednesday, 24 July 2013 at 17:26:58 UTC, Peter Alexander wrote:
> On Wednesday, 24 July 2013 at 17:13:10 UTC, Jakob Ovrum wrote:
>> 	auto newlinePattern = ctRegex!"[\r\n]+";
>
> That will swallow empty lines.

Yeah, it's just an example. The specific pattern obviously depends on the exact behaviour you want, but I think any desired behaviour regarding newlines can be trivially expressed with regex.
« First   ‹ Prev
1 2