Jump to page: 1 2
Thread overview
Using lazy code to process large files
Aug 02, 2017
Martin Drašar
Aug 02, 2017
Martin Drašar
Aug 02, 2017
kdevel
Aug 02, 2017
kdevel
Aug 02, 2017
kdevel
Aug 02, 2017
ag0aep6g
Aug 02, 2017
Daniel Kozak
Aug 02, 2017
Daniel Kozak
Aug 02, 2017
kdevel
August 02, 2017
Hi,

I am struggling to use a lazy range-based code to process large text files. My task is simple, i.e., I can write a non-range-based code in a really short time, but I wanted to try different approach and I am hitting a wall after wall.

Task: read a csv-like input, take only lines starting with some string, split by a comma, remove leading and trailing whitespaces from splitted elements, join by comma again and write to an output.

My attempt so far:

alias stringStripLeft = std.string.stripLeft;

auto input  = File("input.csv");
auto output = File("output.csv");

auto result = input.byLine()
                   .filter!(a => a.startsWith("..."))
                   .map!(a => a.splitter(","))
                   .stringStripleft // <-- errors start here
                   .join(",");

output.write(result);

Needless to say, this does not compile. Basically, I don't know how to feed MapResults to splitter and then sensibly join it.

Thank you for any hint.
Martin
August 02, 2017
On 8/2/17 7:44 AM, Martin Drašar via Digitalmars-d-learn wrote:
> Hi,
> 
> I am struggling to use a lazy range-based code to process large text
> files. My task is simple, i.e., I can write a non-range-based code in a
> really short time, but I wanted to try different approach and I am
> hitting a wall after wall.
> 
> Task: read a csv-like input, take only lines starting with some string,
> split by a comma, remove leading and trailing whitespaces from splitted
> elements, join by comma again and write to an output.
> 
> My attempt so far:
> 
> alias stringStripLeft = std.string.stripLeft;
> 
> auto input  = File("input.csv");
> auto output = File("output.csv");
> 
> auto result = input.byLine()
>                     .filter!(a => a.startsWith("..."))
>                     .map!(a => a.splitter(","))
>                     .stringStripleft // <-- errors start here
>                     .join(",");
> 
> output.write(result);
> 
> Needless to say, this does not compile. Basically, I don't know how to
> feed MapResults to splitter and then sensibly join it.

The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.

Then when you apply stringStripleft, you are applying to the map result, not the splitter result.

What you need is to bury the action on each string into the map:

.map!(a => a.splitter(",").map!(stringStripLeft).join(","))

The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.

Disclaimer: I haven't tested to see this works, but I think it should.

Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.

I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]

You could potentially output like this:

output.write(result.joiner("\n"));

Which I think will work. Again, no testing.

I wouldn't expect good performance from this, as there is auto-decoding all over the place.

-Steve
August 02, 2017
On Wednesday, 2 August 2017 at 11:44:30 UTC, Martin Drašar wrote:
> Thank you for any hint.

      1 import std.stdio;
      2 import std.string;
      3 import std.algorithm;
      4 import std.conv;
      5
      6 void main ()
      7 {
      8    auto input  = File("input.csv");
      9
     10    auto result = input.byLine()
     11       .filter!(a => a.startsWith("..."))
     12       .map!(a => a.splitter(",")
     13          .map!(b => b.stripLeft)
     14          .join(","))
     15       .join("\n");
     16
     17    auto output = File("output.csv", "w");
     18    output.write(result);
     19 }

August 02, 2017
Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn napsal(a):

> The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.
> 
> Then when you apply stringStripleft, you are applying to the map result, not the splitter result.
> 
> What you need is to bury the action on each string into the map:
> 
> .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
> 
> The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.
> 
> Disclaimer: I haven't tested to see this works, but I think it should.
> 
> Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.
> 
> I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]
> 
> You could potentially output like this:
> 
> output.write(result.joiner("\n"));
> 
> Which I think will work. Again, no testing.
> 
> I wouldn't expect good performance from this, as there is auto-decoding all over the place.
> 
> -Steve

Thanks Steven for the explanation. Just to clarify - what would be needed to avoid auto-decoding in this case? Process it all as an arrays, using byChunk to read it, etc?

@kdevel: Thank you for your solution as well.

Martin
August 02, 2017
using http://dlang.org/phobos/std_utf.html#byCodeUnit could help

On Wed, Aug 2, 2017 at 2:59 PM, Martin Drašar via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn
> napsal(a):
>
> > The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.
> >
> > Then when you apply stringStripleft, you are applying to the map result, not the splitter result.
> >
> > What you need is to bury the action on each string into the map:
> >
> > .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
> >
> > The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.
> >
> > Disclaimer: I haven't tested to see this works, but I think it should.
> >
> > Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.
> >
> > I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]
> >
> > You could potentially output like this:
> >
> > output.write(result.joiner("\n"));
> >
> > Which I think will work. Again, no testing.
> >
> > I wouldn't expect good performance from this, as there is auto-decoding all over the place.
> >
> > -Steve
>
> Thanks Steven for the explanation. Just to clarify - what would be needed to avoid auto-decoding in this case? Process it all as an arrays, using byChunk to read it, etc?
>
> @kdevel: Thank you for your solution as well.
>
> Martin
>


August 02, 2017
something like file.byLine.map!(a=>a.byCodeUnit)

On Wed, Aug 2, 2017 at 3:01 PM, Daniel Kozak <kozzi11@gmail.com> wrote:

> using http://dlang.org/phobos/std_utf.html#byCodeUnit could help
>
> On Wed, Aug 2, 2017 at 2:59 PM, Martin Drašar via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
>
>> Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn
>> napsal(a):
>>
>> > The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.
>> >
>> > Then when you apply stringStripleft, you are applying to the map result, not the splitter result.
>> >
>> > What you need is to bury the action on each string into the map:
>> >
>> > .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
>> >
>> > The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.
>> >
>> > Disclaimer: I haven't tested to see this works, but I think it should.
>> >
>> > Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.
>> >
>> > I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]
>> >
>> > You could potentially output like this:
>> >
>> > output.write(result.joiner("\n"));
>> >
>> > Which I think will work. Again, no testing.
>> >
>> > I wouldn't expect good performance from this, as there is auto-decoding all over the place.
>> >
>> > -Steve
>>
>> Thanks Steven for the explanation. Just to clarify - what would be needed to avoid auto-decoding in this case? Process it all as an arrays, using byChunk to read it, etc?
>>
>> @kdevel: Thank you for your solution as well.
>>
>> Martin
>>
>
>


August 02, 2017
On 8/2/17 8:59 AM, Martin Drašar via Digitalmars-d-learn wrote:

> Thanks Steven for the explanation. Just to clarify - what would be
> needed to avoid auto-decoding in this case? Process it all as an arrays,
> using byChunk to read it, etc?
> 

As Daniel said, using byCodeUnit will help.

I don't know what the result of this is when outputting, however. I'd be concerned it just integer promoted the data to dchars before outputting. If your file data is all ASCII it should work fine. You'd have to experiment to see how it works.

-Steve
August 02, 2017
On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer wrote:
> As Daniel said, using byCodeUnit will help.

stripLeft seems to autodecode even when fed with CodeUnits. How do I prevent this?

      1 void main ()
      2 {
      3    import std.stdio;
      4    import std.string;
      5    import std.conv;
      6    import std.utf;
      7    import std.algorithm;
      8
      9    string [] src = [ " \xfc" ]; // blank + latin-1 encoded u umlaut
     10    auto result = src
     11       .map!(a => a.byCodeUnit)
     12       .map!(a => a.stripLeft);
     13    result.writeln;
     14 }

Crashes with a C++-like dump.

August 02, 2017
On 8/2/17 11:02 AM, kdevel wrote:
> On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer wrote:
>> As Daniel said, using byCodeUnit will help.
> 
> stripLeft seems to autodecode even when fed with CodeUnits. How do I prevent this?
> 
>        1 void main ()
>        2 {
>        3    import std.stdio;
>        4    import std.string;
>        5    import std.conv;
>        6    import std.utf;
>        7    import std.algorithm;
>        8
>        9    string [] src = [ " \xfc" ]; // blank + latin-1 encoded u umlaut
>       10    auto result = src
>       11       .map!(a => a.byCodeUnit)
>       12       .map!(a => a.stripLeft);
>       13    result.writeln;
>       14 }
> 
> Crashes with a C++-like dump.
> 

First, as a tip, please post either a link to a paste site, or don't put the line numbers. It's much easier to copy-paste your code into an editor if you don't have the line numbers.

What has happened is that you injected a non-encoded code point. In UTF8, any code point above 0x7f must be encoded into a string of several code units. See the table on this page: https://en.wikipedia.org/wiki/%C3%9C

If we use the correct code unit sequence (0xc3 0x9c), then it works: https://run.dlang.io/is/4umQoo

-Steve
August 02, 2017
On Wednesday, 2 August 2017 at 15:52:13 UTC, Steven Schveighoffer wrote:

[...]

> First, as a tip, please post either a link to a paste site, or don't put the line numbers. It's much easier to copy-paste your code into an editor if you don't have the line numbers.

With pleasure.

[...]

> If we use the correct code unit sequence (0xc3 0x9c), then [...]

If I avoid std.string.stripLeft and use std.algorithm.stripLeft(' ') instead it works as expected:

void main ()
{
   import std.stdio;
   import std.utf;
   import std.algorithm;

   string [] src = [ " \xfc" ]; // blank + latin-1 encoded u umlaut
   auto result = src
      .map!(a => a.byCodeUnit)
      .map!(a => a.stripLeft(' '));
   result.writeln;
}


« First   ‹ Prev
1 2