Using lazy code to process large files - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Using lazy code to process large files

Thread overview

Using lazy code to process large files
Aug 02, 2017 Martin Drašar
Aug 02, 2017 Steven Schveighoffer
Aug 02, 2017 Martin Drašar
Aug 02, 2017 Steven Schveighoffer
Aug 02, 2017 kdevel
Aug 02, 2017 Steven Schveighoffer
Aug 02, 2017 kdevel
Aug 02, 2017 Steven Schveighoffer
Aug 02, 2017 kdevel
Aug 02, 2017 ag0aep6g
Aug 02, 2017 Steven Schveighoffer
Aug 02, 2017 Daniel Kozak
Aug 02, 2017 Daniel Kozak
Aug 02, 2017 kdevel

August 02, 2017

Using lazy code to process large files

Posted by Martin Drašar

Martin Drašar

Hi,

I am struggling to use a lazy range-based code to process large text files. My task is simple, i.e., I can write a non-range-based code in a really short time, but I wanted to try different approach and I am hitting a wall after wall.

Task: read a csv-like input, take only lines starting with some string, split by a comma, remove leading and trailing whitespaces from splitted elements, join by comma again and write to an output.

My attempt so far:

alias stringStripLeft = std.string.stripLeft;

auto input  = File("input.csv");
auto output = File("output.csv");

auto result = input.byLine()
                   .filter!(a => a.startsWith("..."))
                   .map!(a => a.splitter(","))
                   .stringStripleft // <-- errors start here
                   .join(",");

output.write(result);

Needless to say, this does not compile. Basically, I don't know how to feed MapResults to splitter and then sensibly join it.

Thank you for any hint.
Martin

August 02, 2017

Re: Using lazy code to process large files

Posted by Steven Schveighoffer
in reply to Martin Drašar

Steven Schveighoffer

Posted in reply to Martin Drašar

On 8/2/17 7:44 AM, Martin Drašar via Digitalmars-d-learn wrote:
> Hi,
> 
> I am struggling to use a lazy range-based code to process large text
> files. My task is simple, i.e., I can write a non-range-based code in a
> really short time, but I wanted to try different approach and I am
> hitting a wall after wall.
> 
> Task: read a csv-like input, take only lines starting with some string,
> split by a comma, remove leading and trailing whitespaces from splitted
> elements, join by comma again and write to an output.
> 
> My attempt so far:
> 
> alias stringStripLeft = std.string.stripLeft;
> 
> auto input  = File("input.csv");
> auto output = File("output.csv");
> 
> auto result = input.byLine()
>                     .filter!(a => a.startsWith("..."))
>                     .map!(a => a.splitter(","))
>                     .stringStripleft // <-- errors start here
>                     .join(",");
> 
> output.write(result);
> 
> Needless to say, this does not compile. Basically, I don't know how to
> feed MapResults to splitter and then sensibly join it.

The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.

Then when you apply stringStripleft, you are applying to the map result, not the splitter result.

What you need is to bury the action on each string into the map:

.map!(a => a.splitter(",").map!(stringStripLeft).join(","))

The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.

Disclaimer: I haven't tested to see this works, but I think it should.

Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.

I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]

You could potentially output like this:

output.write(result.joiner("\n"));

Which I think will work. Again, no testing.

I wouldn't expect good performance from this, as there is auto-decoding all over the place.

-Steve

August 02, 2017

Re: Using lazy code to process large files

Posted by kdevel
in reply to Martin Drašar

kdevel

Posted in reply to Martin Drašar

On Wednesday, 2 August 2017 at 11:44:30 UTC, Martin Drašar wrote:
> Thank you for any hint.

      1 import std.stdio;
      2 import std.string;
      3 import std.algorithm;
      4 import std.conv;
      5
      6 void main ()
      7 {
      8    auto input  = File("input.csv");
      9
     10    auto result = input.byLine()
     11       .filter!(a => a.startsWith("..."))
     12       .map!(a => a.splitter(",")
     13          .map!(b => b.stripLeft)
     14          .join(","))
     15       .join("\n");
     16
     17    auto output = File("output.csv", "w");
     18    output.write(result);
     19 }

August 02, 2017

Re: Using lazy code to process large files

Posted by Martin Drašar
in reply to Steven Schveighoffer

Martin Drašar

Posted in reply to Steven Schveighoffer

Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn napsal(a):

> The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.
> 
> Then when you apply stringStripleft, you are applying to the map result, not the splitter result.
> 
> What you need is to bury the action on each string into the map:
> 
> .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
> 
> The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.
> 
> Disclaimer: I haven't tested to see this works, but I think it should.
> 
> Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.
> 
> I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]
> 
> You could potentially output like this:
> 
> output.write(result.joiner("\n"));
> 
> Which I think will work. Again, no testing.
> 
> I wouldn't expect good performance from this, as there is auto-decoding all over the place.
> 
> -Steve

Thanks Steven for the explanation. Just to clarify - what would be needed to avoid auto-decoding in this case? Process it all as an arrays, using byChunk to read it, etc?

@kdevel: Thank you for your solution as well.

Martin

August 02, 2017

Re: Using lazy code to process large files

Posted by Daniel Kozak

Daniel Kozak

Attachments:

text/html part

using http://dlang.org/phobos/std_utf.html#byCodeUnit could help

On Wed, Aug 2, 2017 at 2:59 PM, Martin Drašar via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn
> napsal(a):
>
> > The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.
> >
> > Then when you apply stringStripleft, you are applying to the map result, not the splitter result.
> >
> > What you need is to bury the action on each string into the map:
> >
> > .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
> >
> > The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.
> >
> > Disclaimer: I haven't tested to see this works, but I think it should.
> >
> > Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.
> >
> > I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]
> >
> > You could potentially output like this:
> >
> > output.write(result.joiner("\n"));
> >
> > Which I think will work. Again, no testing.
> >
> > I wouldn't expect good performance from this, as there is auto-decoding all over the place.
> >
> > -Steve
>
> Thanks Steven for the explanation. Just to clarify - what would be needed to avoid auto-decoding in this case? Process it all as an arrays, using byChunk to read it, etc?
>
> @kdevel: Thank you for your solution as well.
>
> Martin
>

August 02, 2017

Re: Using lazy code to process large files

Posted by Daniel Kozak

Daniel Kozak

Attachments:

text/html part

something like file.byLine.map!(a=>a.byCodeUnit)

On Wed, Aug 2, 2017 at 3:01 PM, Daniel Kozak <kozzi11@gmail.com> wrote:

> using http://dlang.org/phobos/std_utf.html#byCodeUnit could help
>
> On Wed, Aug 2, 2017 at 2:59 PM, Martin Drašar via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
>
>> Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn
>> napsal(a):
>>
>> > The problem is that you are 2 ranges deep when you apply splitter. The result of the map is a range of ranges.
>> >
>> > Then when you apply stringStripleft, you are applying to the map result, not the splitter result.
>> >
>> > What you need is to bury the action on each string into the map:
>> >
>> > .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
>> >
>> > The internal map is because stripLeft doesn't take a range of strings (the result of splitter), it takes a range of dchar (which is each element of splitter). So you use map to apply the function to every element.
>> >
>> > Disclaimer: I haven't tested to see this works, but I think it should.
>> >
>> > Note that I have forwarded your call to join, even though this actually is not lazy, it builds a string out of it (and actually probably a dstring). Use joiner to do it truly lazily.
>> >
>> > I will also note that the result is not going to look like what you think, as outputting a range looks like this: [element, element, element, ...]
>> >
>> > You could potentially output like this:
>> >
>> > output.write(result.joiner("\n"));
>> >
>> > Which I think will work. Again, no testing.
>> >
>> > I wouldn't expect good performance from this, as there is auto-decoding all over the place.
>> >
>> > -Steve
>>
>> Thanks Steven for the explanation. Just to clarify - what would be needed to avoid auto-decoding in this case? Process it all as an arrays, using byChunk to read it, etc?
>>
>> @kdevel: Thank you for your solution as well.
>>
>> Martin
>>
>
>

August 02, 2017

Re: Using lazy code to process large files

Posted by Steven Schveighoffer
in reply to Martin Drašar

Steven Schveighoffer

Posted in reply to Martin Drašar

On 8/2/17 8:59 AM, Martin Drašar via Digitalmars-d-learn wrote:

> Thanks Steven for the explanation. Just to clarify - what would be
> needed to avoid auto-decoding in this case? Process it all as an arrays,
> using byChunk to read it, etc?
> 

As Daniel said, using byCodeUnit will help.

I don't know what the result of this is when outputting, however. I'd be concerned it just integer promoted the data to dchars before outputting. If your file data is all ASCII it should work fine. You'd have to experiment to see how it works.

-Steve

August 02, 2017

Re: Using lazy code to process large files

Posted by kdevel
in reply to Steven Schveighoffer

kdevel

Posted in reply to Steven Schveighoffer

On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer wrote:
> As Daniel said, using byCodeUnit will help.

stripLeft seems to autodecode even when fed with CodeUnits. How do I prevent this?

      1 void main ()
      2 {
      3    import std.stdio;
      4    import std.string;
      5    import std.conv;
      6    import std.utf;
      7    import std.algorithm;
      8
      9    string [] src = [ " \xfc" ]; // blank + latin-1 encoded u umlaut
     10    auto result = src
     11       .map!(a => a.byCodeUnit)
     12       .map!(a => a.stripLeft);
     13    result.writeln;
     14 }

Crashes with a C++-like dump.

August 02, 2017

Re: Using lazy code to process large files

Posted by Steven Schveighoffer
in reply to kdevel

Steven Schveighoffer

Posted in reply to kdevel

On 8/2/17 11:02 AM, kdevel wrote:
> On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer wrote:
>> As Daniel said, using byCodeUnit will help.
> 
> stripLeft seems to autodecode even when fed with CodeUnits. How do I prevent this?
> 
>        1 void main ()
>        2 {
>        3    import std.stdio;
>        4    import std.string;
>        5    import std.conv;
>        6    import std.utf;
>        7    import std.algorithm;
>        8
>        9    string [] src = [ " \xfc" ]; // blank + latin-1 encoded u umlaut
>       10    auto result = src
>       11       .map!(a => a.byCodeUnit)
>       12       .map!(a => a.stripLeft);
>       13    result.writeln;
>       14 }
> 
> Crashes with a C++-like dump.
> 

First, as a tip, please post either a link to a paste site, or don't put the line numbers. It's much easier to copy-paste your code into an editor if you don't have the line numbers.

What has happened is that you injected a non-encoded code point. In UTF8, any code point above 0x7f must be encoded into a string of several code units. See the table on this page: https://en.wikipedia.org/wiki/%C3%9C

If we use the correct code unit sequence (0xc3 0x9c), then it works: https://run.dlang.io/is/4umQoo

-Steve

August 02, 2017

Re: Using lazy code to process large files

Posted by kdevel
in reply to Steven Schveighoffer

kdevel

Posted in reply to Steven Schveighoffer

On Wednesday, 2 August 2017 at 15:52:13 UTC, Steven Schveighoffer wrote:

[...]

> First, as a tip, please post either a link to a paste site, or don't put the line numbers. It's much easier to copy-paste your code into an editor if you don't have the line numbers.

With pleasure.

[...]

> If we use the correct code unit sequence (0xc3 0x9c), then [...]

If I avoid std.string.stripLeft and use std.algorithm.stripLeft(' ') instead it works as expected:

void main ()
{
   import std.stdio;
   import std.utf;
   import std.algorithm;

   string [] src = [ " \xfc" ]; // blank + latin-1 encoded u umlaut
   auto result = src
      .map!(a => a.byCodeUnit)
      .map!(a => a.stripLeft(' '));
   result.writeln;
}

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation