Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
February 03, 2014 Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
I'm running into a problem I've come across before but never found a satisfactory solution for. There's a pretty large ascii file I need to process, currently about 3GB but size will increase in the future. D's ranges in combination with std.algorithm are simply perfect for what I'm doing, and it's trivial to write nice code which doesn't load the entire file into memory. The problem is speed. I'm using LockingTextReader in std.stdio, but it't not nearly fast enough. On my system it only reads about 3 MB/s with one core spending all it's time in IO calls. Sadly I need to support 32 bit, so memory mapped files aren't an option. Does someone know of a way to increase throughput while still allowing me to use a range API? |
February 03, 2014 Re: Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rene Zwanenburg | Rene Zwanenburg: > The problem is speed. I'm using LockingTextReader in std.stdio, but it't not nearly fast enough. On my system it only reads about 3 MB/s with one core spending all it's time in IO calls. Are you reading the text by lines? In Bugzilla there is a byLineFast: https://d.puremagic.com/issues/show_bug.cgi?id=11810 Bye, bearophile |
February 04, 2014 Re: Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Monday, 3 February 2014 at 23:50:54 UTC, bearophile wrote:
> Rene Zwanenburg:
>
>> The problem is speed. I'm using LockingTextReader in std.stdio, but it't not nearly fast enough. On my system it only reads about 3 MB/s with one core spending all it's time in IO calls.
>
> Are you reading the text by lines? In Bugzilla there is a byLineFast:
> https://d.puremagic.com/issues/show_bug.cgi?id=11810
>
> Bye,
> bearophile
Nope, I'm feeding it to csvReader which uses an input range of characters. Come to think of it..
Well this is embarassing, I've been sloppy with my profiling :). It appears the time is actually spent converting strings to doubles, done by csvReader to read a row into my Record struct. No way to speed that up I suppose. Still I find it surprising that parsing doubles is so slow.
|
February 05, 2014 Re: Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rene Zwanenburg | On Tuesday, 4 February 2014 at 00:04:23 UTC, Rene Zwanenburg wrote:
> On Monday, 3 February 2014 at 23:50:54 UTC, bearophile wrote:
>> Rene Zwanenburg:
>>
>>> The problem is speed. I'm using LockingTextReader in std.stdio, but it't not nearly fast enough. On my system it only reads about 3 MB/s with one core spending all it's time in IO calls.
>>
>> Are you reading the text by lines? In Bugzilla there is a byLineFast:
>> https://d.puremagic.com/issues/show_bug.cgi?id=11810
>>
>> Bye,
>> bearophile
>
> Nope, I'm feeding it to csvReader which uses an input range of characters. Come to think of it..
>
> Well this is embarassing, I've been sloppy with my profiling :). It appears the time is actually spent converting strings to doubles, done by csvReader to read a row into my Record struct. No way to speed that up I suppose. Still I find it surprising that parsing doubles is so slow.
Parsing should be faster than I/O. Set up two buffers and have one thread reading into buffer A while you parse buffer B with a second thread.
|
February 05, 2014 Re: Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris Williams | > Parsing should be faster than I/O. Set up two buffers and have one thread reading into buffer A while you parse buffer B with a second thread.
...and then flip buffers whenever the slower of the two has completed.
|
February 05, 2014 Re: Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rene Zwanenburg | You can also try a BufferedRange. http://forum.dlang.org/thread/l9q66g$2he3$1@digitalmars.com |
February 07, 2014 Re: Performant method for reading huge text files | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rene Zwanenburg | Am Tue, 04 Feb 2014 00:04:22 +0000 schrieb "Rene Zwanenburg" <renezwanenburg@gmail.com>: > On Monday, 3 February 2014 at 23:50:54 UTC, bearophile wrote: > > Rene Zwanenburg: > > > >> The problem is speed. I'm using LockingTextReader in std.stdio, but it't not nearly fast enough. On my system it only reads about 3 MB/s with one core spending all it's time in IO calls. > > > > Are you reading the text by lines? In Bugzilla there is a byLineFast: https://d.puremagic.com/issues/show_bug.cgi?id=11810 > > > > Bye, > > bearophile > > Nope, I'm feeding it to csvReader which uses an input range of characters. Come to think of it.. > > Well this is embarassing, I've been sloppy with my profiling :). It appears the time is actually spent converting strings to doubles, done by csvReader to read a row into my Record struct. No way to speed that up I suppose. Still I find it surprising that parsing doubles is so slow. Parsing textual representations of numbers is slow. The other way around is faster. You have to check all kinds of stuff, like preceding +/-, starts with a dot, are all characters '0' to '9', is there an exponent? Is it "NaN" or "nan"? Floating point math is slow, but when you store the intermediate results while parsing inside an integer, you may run out of digits if the number string is long. On the other hand repeated floating point math will introduce some error as you append digits. Here is the ~400 lines version in Phobos: https://github.com/D-Programming-Language/phobos/blob/master/std/conv.d#L2250 -- Marco |
Copyright © 1999-2021 by the D Language Foundation