Jump to page: 1 24  
Page
Thread overview
Reserving/Preallocating associative array?
Dec 24, 2013
Gordon
Dec 24, 2013
bearophile
Dec 24, 2013
bearophile
Dec 24, 2013
Gordon
Dec 25, 2013
Philippe Sigaud
Dec 25, 2013
Gordon
Dec 27, 2013
Marco Leise
Dec 27, 2013
Benjamin Thaut
Dec 27, 2013
bearophile
Dec 27, 2013
Benjamin Thaut
Dec 27, 2013
Gordon
Dec 27, 2013
bearophile
Dec 27, 2013
Benjamin Thaut
Dec 27, 2013
Benjamin Thaut
Dec 27, 2013
Gordon
Dec 27, 2013
Benjamin Thaut
Dec 27, 2013
Peter Alexander
Dec 26, 2013
bearophile
Dec 26, 2013
thedeemon
Dec 24, 2013
bearophile
Dec 25, 2013
Gordon
Dec 25, 2013
Philippe Sigaud
Dec 25, 2013
John Colvin
Dec 25, 2013
H. S. Teoh
Dec 25, 2013
bearophile
Dec 25, 2013
Gordon
Dec 26, 2013
Benjamin Thaut
Dec 26, 2013
Gordon
Dec 27, 2013
Daniel Kozak
Dec 27, 2013
Benjamin Thaut
Dec 28, 2013
Benjamin Thaut
Dec 30, 2013
Daniel Kozak
Dec 30, 2013
Benjamin Thaut
December 24, 2013
Hello,

I want to load a large text file containing two numeric fields into an associative array.
The file looks like:
   1   40
   4   2
   42  11
   ...

And has 11M lines.

My code looks like this:
===
void main()
{
        size_t[size_t] unions;
        auto f = File("input.txt");
        foreach ( line ; f.byLine() ) {
                auto fields = line.split();
                size_t i = to!size_t(fields[0]);
                size_t j = to!size_t(fields[1]);
                unions[i] = j; // <-- here be question
        }
}
===

This is just a test code to illustrate my question (though general comments are welcomed - I'm new to D).

Commenting out the highlighted line (not populating the hash), the program completes in 25 seconds.
Compiling with the highlighted line, the program takes ~3.5 minutes.

Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick?

Many thanks,
 -gordon
December 24, 2013
Gordon:

> void main()
> {
>         size_t[size_t] unions;
>         auto f = File("input.txt");
>         foreach ( line ; f.byLine() ) {
>                 auto fields = line.split();
>                 size_t i = to!size_t(fields[0]);
>                 size_t j = to!size_t(fields[1]);
>                 unions[i] = j; // <-- here be question
>         }
> }
> ...
> Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick?

Built-in associative arrays don't yet have a "reserve".

Some ides to speed up your code:
- Try to use parse instead of split + to!size_t
- Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
- Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;

Bye,
bearophile
December 24, 2013
Gordon:

> The file looks like:
>    1   40
>    4   2
>    42  11
>    ...
>
> And has 11M lines.

What's the interval of the numbers of the first column?

Bye,
bearophile
December 24, 2013
> Some ides to speed up your code:
> - Try to use parse instead of split + to!size_t
> - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
> - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;


This code is almost 2.5X faster on my PC (but I am using only 300_000 lines of text):

void main() {
    import std.stdio, std.conv, std.string, core.memory;
    import bylinefast;

    GC.disable;
    size_t[size_t] unions;
    foreach (line; "input.txt".File.byLineFast) {
        line.munch(" \t"); // skip ws
        immutable i = line.parse!size_t;
        line.munch(" \t"); // skip ws
        immutable j = line.parse!size_t;
        unions[i] = j;
    }
    GC.enable;
}


Bye,
bearophile
December 24, 2013
Hello Bearophine,

On Tuesday, 24 December 2013 at 23:13:59 UTC, bearophile wrote:
>> Some ides to speed up your code:
>> - Try to use parse instead of split + to!size_t
>> - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
>> - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;
>
>
> This code is almost 2.5X faster on my PC (but I am using only 300_000 lines of text):
>

You're just too fast for me...

After incorporating your three suggestions, the entire file (11M lines) loads in just 25 seconds (down from 3.5 minutes).

AMAZING!

Many thanks,
 -gordon
December 24, 2013
On 12/24/13 2:28 PM, Gordon wrote:
> Hello,
>
> I want to load a large text file containing two numeric fields into an
> associative array.
> The file looks like:
>     1   40
>     4   2
>     42  11
>     ...
>
> And has 11M lines.
>
> My code looks like this:
> ===
> void main()
> {
>          size_t[size_t] unions;
>          auto f = File("input.txt");
>          foreach ( line ; f.byLine() ) {
>                  auto fields = line.split();
>                  size_t i = to!size_t(fields[0]);
>                  size_t j = to!size_t(fields[1]);
>                  unions[i] = j; // <-- here be question
>          }
> }
> ===
>
> This is just a test code to illustrate my question (though general
> comments are welcomed - I'm new to D).
>
> Commenting out the highlighted line (not populating the hash), the
> program completes in 25 seconds.
> Compiling with the highlighted line, the program takes ~3.5 minutes.
>
> Is there a way to speed the loading? perhaps reserving memory in the
> hash before populating it? Or another trick?

void main()
{
    size_t[size_t] unions;
    foreach (e; "input.txt"
            .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
        unions[e[0]] = e[1];
    }
}


Andrei

December 25, 2013
Out of curiosity, do you know which one of his 3 suggestions brought you the highest speed boost? What happens if you do not disable the GC, for example?


December 25, 2013
On Wednesday, 25 December 2013 at 08:27:53 UTC, Philippe Sigaud wrote:
> Out of curiosity, do you know which one of his 3 suggestions brought you
> the highest speed boost? What happens if you do not disable the GC, for
> example?

Good question, I did test each one incrementally:

1. Switching from "split+to!size_t" to "parse" (and commenting out the "union") reduced running time from ~26s to ~20s (on average).

2. Switching from "byLine" to "byLineFast" (and commenting out the "union") reduced running time from ~20s to ~14s (on average).

3. With "parse" and "byLineFast", and with GC still on, and populating the "union" the program took about 2m45s .

4. Adding "GC.disable" brought it down to 25s.

HTH,
 -gordon
December 25, 2013
On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote:
>
> void main()
> {
>     size_t[size_t] unions;
>     foreach (e; "input.txt"
>             .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
>         unions[e[0]] = e[1];
>     }
> }
>

Thanks for this very elegant solution!

For completeness (since we're dealing with timing):

1. Running the above code with Garbage-collection enabled, takes 1m45s.

2. Running it with GC disabled takes 50s .

3. Running with GC disabled and without sort+uniq (my data is already uniq'd), takes 40s.


This is a beautiful idiomatic D code, but for now I think I'll stick with the more verbose code above.

Ine reason is that I want to provide helpful and verbose error messages for invalid input (e.g. "Invalid numeric value found in field 3 line 1000") - and with the "slurp" method, any input error will result in a not-so-helpful exception (e.g. "std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(2009): Unexpected 'H' when converting from type char[] to type ulong).

Many thanks,
 -gordon
December 25, 2013
On 12/25/13 8:29 AM, Gordon wrote:
> On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote:
>>
>> void main()
>> {
>>     size_t[size_t] unions;
>>     foreach (e; "input.txt"
>>             .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
>>         unions[e[0]] = e[1];
>>     }
>> }
>>
>
> Thanks for this very elegant solution!
>
> For completeness (since we're dealing with timing):
>
> 1. Running the above code with Garbage-collection enabled, takes 1m45s.
>
> 2. Running it with GC disabled takes 50s .
>
> 3. Running with GC disabled and without sort+uniq (my data is already
> uniq'd), takes 40s.

Thanks for the numbers!

> This is a beautiful idiomatic D code, but for now I think I'll stick
> with the more verbose code above.
>
> Ine reason is that I want to provide helpful and verbose error messages
> for invalid input (e.g. "Invalid numeric value found in field 3 line
> 1000") - and with the "slurp" method, any input error will result in a
> not-so-helpful exception (e.g.
> "std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(2009):
> Unexpected 'H' when converting from type char[] to type ulong).

Thanks, I added https://d.puremagic.com/issues/show_bug.cgi?id=11816 to keep track of this.


Andrei


« First   ‹ Prev
1 2 3 4