Thread overview | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
December 24, 2013 Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Hello, I want to load a large text file containing two numeric fields into an associative array. The file looks like: 1 40 4 2 42 11 ... And has 11M lines. My code looks like this: === void main() { size_t[size_t] unions; auto f = File("input.txt"); foreach ( line ; f.byLine() ) { auto fields = line.split(); size_t i = to!size_t(fields[0]); size_t j = to!size_t(fields[1]); unions[i] = j; // <-- here be question } } === This is just a test code to illustrate my question (though general comments are welcomed - I'm new to D). Commenting out the highlighted line (not populating the hash), the program completes in 25 seconds. Compiling with the highlighted line, the program takes ~3.5 minutes. Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick? Many thanks, -gordon |
December 24, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gordon | Gordon: > void main() > { > size_t[size_t] unions; > auto f = File("input.txt"); > foreach ( line ; f.byLine() ) { > auto fields = line.split(); > size_t i = to!size_t(fields[0]); > size_t j = to!size_t(fields[1]); > unions[i] = j; // <-- here be question > } > } > ... > Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick? Built-in associative arrays don't yet have a "reserve". Some ides to speed up your code: - Try to use parse instead of split + to!size_t - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305 - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable; Bye, bearophile |
December 24, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gordon | Gordon:
> The file looks like:
> 1 40
> 4 2
> 42 11
> ...
>
> And has 11M lines.
What's the interval of the numbers of the first column?
Bye,
bearophile
|
December 24, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | > Some ides to speed up your code:
> - Try to use parse instead of split + to!size_t
> - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
> - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;
This code is almost 2.5X faster on my PC (but I am using only 300_000 lines of text):
void main() {
import std.stdio, std.conv, std.string, core.memory;
import bylinefast;
GC.disable;
size_t[size_t] unions;
foreach (line; "input.txt".File.byLineFast) {
line.munch(" \t"); // skip ws
immutable i = line.parse!size_t;
line.munch(" \t"); // skip ws
immutable j = line.parse!size_t;
unions[i] = j;
}
GC.enable;
}
Bye,
bearophile
|
December 24, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | Hello Bearophine,
On Tuesday, 24 December 2013 at 23:13:59 UTC, bearophile wrote:
>> Some ides to speed up your code:
>> - Try to use parse instead of split + to!size_t
>> - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
>> - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;
>
>
> This code is almost 2.5X faster on my PC (but I am using only 300_000 lines of text):
>
You're just too fast for me...
After incorporating your three suggestions, the entire file (11M lines) loads in just 25 seconds (down from 3.5 minutes).
AMAZING!
Many thanks,
-gordon
|
December 24, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gordon | On 12/24/13 2:28 PM, Gordon wrote:
> Hello,
>
> I want to load a large text file containing two numeric fields into an
> associative array.
> The file looks like:
> 1 40
> 4 2
> 42 11
> ...
>
> And has 11M lines.
>
> My code looks like this:
> ===
> void main()
> {
> size_t[size_t] unions;
> auto f = File("input.txt");
> foreach ( line ; f.byLine() ) {
> auto fields = line.split();
> size_t i = to!size_t(fields[0]);
> size_t j = to!size_t(fields[1]);
> unions[i] = j; // <-- here be question
> }
> }
> ===
>
> This is just a test code to illustrate my question (though general
> comments are welcomed - I'm new to D).
>
> Commenting out the highlighted line (not populating the hash), the
> program completes in 25 seconds.
> Compiling with the highlighted line, the program takes ~3.5 minutes.
>
> Is there a way to speed the loading? perhaps reserving memory in the
> hash before populating it? Or another trick?
void main()
{
size_t[size_t] unions;
foreach (e; "input.txt"
.slurp!(size_t, size_t)("%s %s").sort.uniq ) {
unions[e[0]] = e[1];
}
}
Andrei
|
December 25, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gordon Attachments:
| Out of curiosity, do you know which one of his 3 suggestions brought you the highest speed boost? What happens if you do not disable the GC, for example? |
December 25, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Philippe Sigaud | On Wednesday, 25 December 2013 at 08:27:53 UTC, Philippe Sigaud wrote:
> Out of curiosity, do you know which one of his 3 suggestions brought you
> the highest speed boost? What happens if you do not disable the GC, for
> example?
Good question, I did test each one incrementally:
1. Switching from "split+to!size_t" to "parse" (and commenting out the "union") reduced running time from ~26s to ~20s (on average).
2. Switching from "byLine" to "byLineFast" (and commenting out the "union") reduced running time from ~20s to ~14s (on average).
3. With "parse" and "byLineFast", and with GC still on, and populating the "union" the program took about 2m45s .
4. Adding "GC.disable" brought it down to 25s.
HTH,
-gordon
|
December 25, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote: > > void main() > { > size_t[size_t] unions; > foreach (e; "input.txt" > .slurp!(size_t, size_t)("%s %s").sort.uniq ) { > unions[e[0]] = e[1]; > } > } > Thanks for this very elegant solution! For completeness (since we're dealing with timing): 1. Running the above code with Garbage-collection enabled, takes 1m45s. 2. Running it with GC disabled takes 50s . 3. Running with GC disabled and without sort+uniq (my data is already uniq'd), takes 40s. This is a beautiful idiomatic D code, but for now I think I'll stick with the more verbose code above. Ine reason is that I want to provide helpful and verbose error messages for invalid input (e.g. "Invalid numeric value found in field 3 line 1000") - and with the "slurp" method, any input error will result in a not-so-helpful exception (e.g. "std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(2009): Unexpected 'H' when converting from type char[] to type ulong). Many thanks, -gordon |
December 25, 2013 Re: Reserving/Preallocating associative array? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gordon | On 12/25/13 8:29 AM, Gordon wrote: > On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote: >> >> void main() >> { >> size_t[size_t] unions; >> foreach (e; "input.txt" >> .slurp!(size_t, size_t)("%s %s").sort.uniq ) { >> unions[e[0]] = e[1]; >> } >> } >> > > Thanks for this very elegant solution! > > For completeness (since we're dealing with timing): > > 1. Running the above code with Garbage-collection enabled, takes 1m45s. > > 2. Running it with GC disabled takes 50s . > > 3. Running with GC disabled and without sort+uniq (my data is already > uniq'd), takes 40s. Thanks for the numbers! > This is a beautiful idiomatic D code, but for now I think I'll stick > with the more verbose code above. > > Ine reason is that I want to provide helpful and verbose error messages > for invalid input (e.g. "Invalid numeric value found in field 3 line > 1000") - and with the "slurp" method, any input error will result in a > not-so-helpful exception (e.g. > "std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(2009): > Unexpected 'H' when converting from type char[] to type ulong). Thanks, I added https://d.puremagic.com/issues/show_bug.cgi?id=11816 to keep track of this. Andrei |
Copyright © 1999-2021 by the D Language Foundation