std.csv Performance Review - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » std.csv Performance Review

Thread overview

std.csv Performance Review
Jun 03, 2017 Jesse Phillips
Jun 03, 2017 Johan Engelen
Jun 03, 2017 bachmeier
Jun 04, 2017 Jesse Phillips
Jun 04, 2017 H. S. Teoh
Jun 04, 2017 Patrick Schluter
Jun 04, 2017 Patrick Schluter
Jun 04, 2017 Jesse Phillips
Jun 04, 2017 Seb
Jun 05, 2017 H. S. Teoh
Jun 04, 2017 Anonymouse
Jun 05, 2017 Jon Degenhardt

June 03, 2017

std.csv Performance Review

Posted by Jesse Phillips

Jesse Phillips

Author here:

The discussion[1] and articles[2] around "Faster Command Line Tools" had me trying out std.csv for the task.

Now I know std.csv isn't fast and it allocates. When I wrote my CSV parser, I'd also left around a parser which used slicing instead of allocation[3].

I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competing with the other implementations. Now I didn't profile std.csv's implementation but I did take a look at the one with slicing.

Majority of the time was spent in std.algorithm.startsWith, which is being called by countUntil. The calls made to empty() also add up from the use in countUntil and startsWith. These functions are by no means slow, startsWith averaged 1 millisecond execution time while countUntil was up to 5 milliseconds; thing is starts with was called a whopping 384,806,160 times.

Keep in mind that the file itself has 10,512,769 rows of data with four columns. Now I've talked to std.csv's performance in the past, probably with the author of the fast command line tools. Essentially it came down to std.csv is restricted to parsing with only the Input Range api, and you can't correctly parse CSV without allocation. But now I'm working outside those restrictions and so I offer an additional point.

Both of these do something none of the other implementation do, it validates the CSV is well formed. If it finds that the file no longer conforms to the correct CSV layout it makes a choice, either throw an exception or guess and continue on (based on the what the user requested). While the Nim implementation does handle escaped quotes (and newlines, unlike fast csv) the parsing assumes the file is well formed, which std.csv was quite prompt to point out that this file is indeed not well formed.

Even though the issue can be ignored, the overhead of parsing to identify issues still remains. I haven't attempted write the algorithm assuming proper data structure so I don't know what the performance would look like, but I suspect it isn't negligible. There is also likely some overhead for providing the tokens through range interfaces.

On another note, including this slicing version of the CSV parse can and should be included in std.csv as a specialization. But it is by no means ready. The feature requirements need to be spelled out better (hasSlicing!Range fails for strings but is the primary use-case for the optimization), escaped quotes remain in the returned data (like I said proper parsing requires allocation).

1. http://forum.dlang.org/post/chvukhbscgamxecvpwlw@forum.dlang.org
2. https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/
3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize

June 03, 2017

Re: std.csv Performance Review

Posted by Johan Engelen
in reply to Jesse Phillips

Johan Engelen

Posted in reply to Jesse Phillips

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
>
> I compared these two: LDC -O3 -release

Quick note:
Keep in mind that LDC does not do cross-module inlining (non-template functions) by default yet. It's good to check whether you see big differences with `-enable-cross-module-inlining`.

-Johan

June 03, 2017

Re: std.csv Performance Review

Posted by bachmeier
in reply to Jesse Phillips

bachmeier

Posted in reply to Jesse Phillips

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
> Author here:
>
> The discussion[1] and articles[2] around "Faster Command Line Tools" had me trying out std.csv for the task.
>
> Now I know std.csv isn't fast and it allocates. When I wrote my CSV parser, I'd also left around a parser which used slicing instead of allocation[3].

Do you know what happened with fastcsv [0], original thread [1].

[0] https://github.com/quickfur/fastcsv
[1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn@puremagic.com

June 04, 2017

Re: std.csv Performance Review

Posted by Jesse Phillips
in reply to bachmeier

Jesse Phillips

Posted in reply to bachmeier

On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
> Do you know what happened with fastcsv [0], original thread [1].
>
> [0] https://github.com/quickfur/fastcsv
> [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn@puremagic.com

I do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.

In other news, I'm not sure the validation std.varient.algebraic. Csv is doing is worth it.

June 03, 2017

Re: std.csv Performance Review

Posted by H. S. Teoh
in reply to Jesse Phillips

H. S. Teoh

Posted in reply to Jesse Phillips

On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d wrote:
> On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
> > Do you know what happened with fastcsv [0], original thread [1].
> > 
> > [0] https://github.com/quickfur/fastcsv
> > [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn@puremagic.com
> 
> I do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.
[...]

You don't have to be skeptical, neither do you have to believe what I claimed.  I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing.  You can just run it yourself and see the results for yourself.

And yes, fastcsv has its limitations (the file has to fit in memory, no validation is done, etc.), which are also documented up-front in the README file.  I wrote the code targeting a specific use case mentioned by the OP of the original thread, so I do not expect or claim you will see the same kind of performance for other use cases. If you want validation, then it's a given that you won't get maximum performance, simply because there's just more work to do.  For data sets that don't fit into memory, I already have some ideas about how to extend my algorithm to work with it, so some of the performance may still be retained. But obviously it's not going to be as fast as if you can just read the entire file into memory first.

(Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself. Again, you don't have to believe me: the fastcsv code is there, just import std.mmfile, mmap the largest CSV file you can find, call fastcsv on it, and measure the performance yourself. If your code performs better, great, tell us all about it. I'm interested to learn how you did it.)

Note that besides slicing, another major part of the performance boost in fastcsv is in minimizing GC allocations.  If you allocate a string for each field in a row, it will be much slower than if you either sliced the original string, or if you allocated a large buffer for holding the data and just take slices for each field.  Furthermore, if you allocate a new array per row to hold the list of fields, it will be much slower than if you allocate a large array for holding all the fields of all the rows, and merely slice this array to get your rows. Of course, you cannot know ahead of time exactly how many rows there will be, so the next best thing is to allocate a series of large arrays, capable of holding the field slices of k rows, for sufficiently large k. Once the current array runs out of space, copy the (partial) slices of the last row to beginning of a new large array, and continue from there.

This way, you will be making n/k allocations, where n is the number of rows and k is the number of rows that fit into each buffer, as opposed to n allocations. For large values of k, this greatly reduces the GC load and significantly speeds things up.  Again, don't take my word for it. Run a profiler on the fastcsv code and see for yourself.

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

June 04, 2017

Re: std.csv Performance Review

Posted by Patrick Schluter
in reply to H. S. Teoh

Patrick Schluter

Posted in reply to H. S. Teoh

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
> On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself.

If the file is in the file cache of the kernel, memory mapping does not need to reload the file as it is already in memory. In fact, calling mmap() changes only the sharing of the pages in general. That's where most of the performance win from memory mapping comes from.

This stackoverflow [1] discussion links to a realworldtech discussion with Linus Torvalds explaining it in detail. On windows and Solaris the mechanism is the same.

[1] https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962

June 04, 2017

Re: std.csv Performance Review

Posted by Patrick Schluter
in reply to Patrick Schluter

Patrick Schluter

Posted in reply to Patrick Schluter

On Sunday, 4 June 2017 at 06:54:46 UTC, Patrick Schluter wrote:
> On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
>> On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself.
>
> If the file is in the file cache of the kernel, memory mapping does not need to reload the file as it is already in memory. In fact, calling mmap() changes only the sharing of the pages in general. That's where most of the performance win from memory mapping comes from.
To be precise, it's the copying of data that is spared by mmap. If the file is in the file cache, the open/read/write/close syscalls will also be fed from the memory mapped cache entry, but this requires that the data is copied from the kernel memory space to the processes buffer space. So each call to read will have to do this copying. So the gain from mmap comes for avoiding the copy of memory and avoiding the syscalls read/write/seek. The loading in memory of the physical file is the same in both cases.


>
> This stackoverflow [1] discussion links to a realworldtech discussion with Linus Torvalds explaining it in detail. On windows and Solaris the mechanism is the same.
>
> [1] https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962

June 04, 2017

Re: std.csv Performance Review

Posted by Anonymouse
in reply to Jesse Phillips

Anonymouse

Posted in reply to Jesse Phillips

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
> Even though the issue can be ignored, the overhead of parsing to identify issues still remains. I haven't attempted write the algorithm assuming proper data structure so I don't know what the performance would look like, but I suspect it isn't negligible. There is also likely some overhead for providing the tokens through range interfaces.

Immediate idea is to have the cake and eat it too with parseXML!(Yes.validate)(...), but more work.

June 04, 2017

Re: std.csv Performance Review

Posted by Jesse Phillips
in reply to H. S. Teoh

Jesse Phillips

Posted in reply to H. S. Teoh

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
> On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d wrote:
>> On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
>> > Do you know what happened with fastcsv [0], original thread [1].
>> > 
>> > [0] https://github.com/quickfur/fastcsv
>> > [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn@puremagic.com
>> 
>> I do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.
> [...]
>
> You don't have to be skeptical, neither do you have to believe what I claimed.  I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing.  You can just run it yourself and see the results for yourself.

Ok, I took you up on that, I'm still skeptical:

LDC2 -O3 -release -enable-cross-module-inlining

std.csv: 12487 msecs
fastcsv (no gc): 1376 msecs
csvslicing: 3039 msecs

That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time.

Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.

June 04, 2017

Re: std.csv Performance Review

Posted by Seb
in reply to Jesse Phillips

Seb

Posted in reply to Jesse Phillips

On Sunday, 4 June 2017 at 15:59:03 UTC, Jesse Phillips wrote:
> On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
>> [...]
>
> Ok, I took you up on that, I'm still skeptical:
>
> LDC2 -O3 -release -enable-cross-module-inlining
>
> std.csv: 12487 msecs
> fastcsv (no gc): 1376 msecs
> csvslicing: 3039 msecs
>
> That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time.
>
> Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.

In case you have time, it would be very interesting to compare it with other state of the art tools like paratext:

http://www.wise.io/tech/paratext

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation