Reserving/Preallocating associative array? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Reserving/Preallocating associative array?

Thread overview

Reserving/Preallocating associative array?
Dec 24, 2013 Gordon
Dec 24, 2013 bearophile
Dec 24, 2013 bearophile
Dec 24, 2013 Gordon
Dec 25, 2013 Philippe Sigaud
Dec 25, 2013 Gordon
Dec 27, 2013 Marco Leise
Dec 27, 2013 Benjamin Thaut
Dec 27, 2013 bearophile
Dec 27, 2013 Benjamin Thaut
Dec 27, 2013 Gordon
Dec 27, 2013 bearophile
Dec 27, 2013 Benjamin Thaut
Dec 27, 2013 Benjamin Thaut
Dec 27, 2013 Gordon
Dec 27, 2013 Benjamin Thaut
Dec 27, 2013 Peter Alexander
Dec 26, 2013 bearophile
Dec 26, 2013 thedeemon
Dec 24, 2013 bearophile
Dec 24, 2013 Andrei Alexandrescu
Dec 25, 2013 Gordon
Dec 25, 2013 Andrei Alexandrescu
Dec 25, 2013 Philippe Sigaud
Dec 25, 2013 John Colvin
Dec 25, 2013 Andrei Alexandrescu
Dec 25, 2013 H. S. Teoh
Dec 25, 2013 bearophile
Dec 25, 2013 Gordon
Dec 26, 2013 Benjamin Thaut
Dec 26, 2013 Gordon
Dec 27, 2013 Daniel Kozak
Dec 27, 2013 Benjamin Thaut
Dec 28, 2013 Benjamin Thaut
Dec 30, 2013 Daniel Kozak
Dec 30, 2013 Benjamin Thaut

December 24, 2013

Reserving/Preallocating associative array?

Posted by Gordon

Gordon

Hello,

I want to load a large text file containing two numeric fields into an associative array.
The file looks like:
   1   40
   4   2
   42  11
   ...

And has 11M lines.

My code looks like this:
===
void main()
{
        size_t[size_t] unions;
        auto f = File("input.txt");
        foreach ( line ; f.byLine() ) {
                auto fields = line.split();
                size_t i = to!size_t(fields[0]);
                size_t j = to!size_t(fields[1]);
                unions[i] = j; // <-- here be question
        }
}
===

This is just a test code to illustrate my question (though general comments are welcomed - I'm new to D).

Commenting out the highlighted line (not populating the hash), the program completes in 25 seconds.
Compiling with the highlighted line, the program takes ~3.5 minutes.

Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick?

Many thanks,
 -gordon

December 24, 2013

Re: Reserving/Preallocating associative array?

Posted by bearophile
in reply to Gordon

bearophile

Posted in reply to Gordon

Gordon:

> void main()
> {
>         size_t[size_t] unions;
>         auto f = File("input.txt");
>         foreach ( line ; f.byLine() ) {
>                 auto fields = line.split();
>                 size_t i = to!size_t(fields[0]);
>                 size_t j = to!size_t(fields[1]);
>                 unions[i] = j; // <-- here be question
>         }
> }
> ...
> Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick?

Built-in associative arrays don't yet have a "reserve".

Some ides to speed up your code:
- Try to use parse instead of split + to!size_t
- Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
- Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;

Bye,
bearophile

December 24, 2013

Re: Reserving/Preallocating associative array?

Posted by bearophile
in reply to Gordon

bearophile

Posted in reply to Gordon

Gordon:

> The file looks like:
>    1   40
>    4   2
>    42  11
>    ...
>
> And has 11M lines.

What's the interval of the numbers of the first column?

Bye,
bearophile

December 24, 2013

Re: Reserving/Preallocating associative array?

Posted by bearophile
in reply to bearophile

bearophile

Posted in reply to bearophile

> Some ides to speed up your code:
> - Try to use parse instead of split + to!size_t
> - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
> - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;


This code is almost 2.5X faster on my PC (but I am using only 300_000 lines of text):

void main() {
    import std.stdio, std.conv, std.string, core.memory;
    import bylinefast;

    GC.disable;
    size_t[size_t] unions;
    foreach (line; "input.txt".File.byLineFast) {
        line.munch(" \t"); // skip ws
        immutable i = line.parse!size_t;
        line.munch(" \t"); // skip ws
        immutable j = line.parse!size_t;
        unions[i] = j;
    }
    GC.enable;
}


Bye,
bearophile

December 24, 2013

Re: Reserving/Preallocating associative array?

Posted by Gordon
in reply to bearophile

Gordon

Posted in reply to bearophile

Hello Bearophine,

On Tuesday, 24 December 2013 at 23:13:59 UTC, bearophile wrote:
>> Some ides to speed up your code:
>> - Try to use parse instead of split + to!size_t
>> - Try to use byLineFast, in the attach here (the code is not mine): https://d.puremagic.com/issues/attachment.cgi?id=1305
>> - Try to disable the GC before the associative array creation and re-enable it when it's built. (This could double your max memory usage or more). To disable it import GC from core.memory and call GC.disable; and GC.enable;
>
>
> This code is almost 2.5X faster on my PC (but I am using only 300_000 lines of text):
>

You're just too fast for me...

After incorporating your three suggestions, the entire file (11M lines) loads in just 25 seconds (down from 3.5 minutes).

AMAZING!

Many thanks,
 -gordon

December 24, 2013

Re: Reserving/Preallocating associative array?

Posted by Andrei Alexandrescu
in reply to Gordon

Andrei Alexandrescu

Posted in reply to Gordon

On 12/24/13 2:28 PM, Gordon wrote:
> Hello,
>
> I want to load a large text file containing two numeric fields into an
> associative array.
> The file looks like:
>     1   40
>     4   2
>     42  11
>     ...
>
> And has 11M lines.
>
> My code looks like this:
> ===
> void main()
> {
>          size_t[size_t] unions;
>          auto f = File("input.txt");
>          foreach ( line ; f.byLine() ) {
>                  auto fields = line.split();
>                  size_t i = to!size_t(fields[0]);
>                  size_t j = to!size_t(fields[1]);
>                  unions[i] = j; // <-- here be question
>          }
> }
> ===
>
> This is just a test code to illustrate my question (though general
> comments are welcomed - I'm new to D).
>
> Commenting out the highlighted line (not populating the hash), the
> program completes in 25 seconds.
> Compiling with the highlighted line, the program takes ~3.5 minutes.
>
> Is there a way to speed the loading? perhaps reserving memory in the
> hash before populating it? Or another trick?

void main()
{
    size_t[size_t] unions;
    foreach (e; "input.txt"
            .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
        unions[e[0]] = e[1];
    }
}


Andrei

December 25, 2013

Re: Reserving/Preallocating associative array?

Posted by Philippe Sigaud
in reply to Gordon

Philippe Sigaud

Posted in reply to Gordon

Attachments:

text/html part

Out of curiosity, do you know which one of his 3 suggestions brought you the highest speed boost? What happens if you do not disable the GC, for example?

December 25, 2013

Re: Reserving/Preallocating associative array?

Posted by Gordon
in reply to Philippe Sigaud

Gordon

Posted in reply to Philippe Sigaud

On Wednesday, 25 December 2013 at 08:27:53 UTC, Philippe Sigaud wrote:
> Out of curiosity, do you know which one of his 3 suggestions brought you
> the highest speed boost? What happens if you do not disable the GC, for
> example?

Good question, I did test each one incrementally:

1. Switching from "split+to!size_t" to "parse" (and commenting out the "union") reduced running time from ~26s to ~20s (on average).

2. Switching from "byLine" to "byLineFast" (and commenting out the "union") reduced running time from ~20s to ~14s (on average).

3. With "parse" and "byLineFast", and with GC still on, and populating the "union" the program took about 2m45s .

4. Adding "GC.disable" brought it down to 25s.

HTH,
 -gordon

December 25, 2013

Re: Reserving/Preallocating associative array?

Posted by Gordon
in reply to Andrei Alexandrescu

Gordon

Posted in reply to Andrei Alexandrescu

On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote:
>
> void main()
> {
>     size_t[size_t] unions;
>     foreach (e; "input.txt"
>             .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
>         unions[e[0]] = e[1];
>     }
> }
>

Thanks for this very elegant solution!

For completeness (since we're dealing with timing):

1. Running the above code with Garbage-collection enabled, takes 1m45s.

2. Running it with GC disabled takes 50s .

3. Running with GC disabled and without sort+uniq (my data is already uniq'd), takes 40s.

This is a beautiful idiomatic D code, but for now I think I'll stick with the more verbose code above.

Ine reason is that I want to provide helpful and verbose error messages for invalid input (e.g. "Invalid numeric value found in field 3 line 1000") - and with the "slurp" method, any input error will result in a not-so-helpful exception (e.g. "std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(2009): Unexpected 'H' when converting from type char[] to type ulong).

Many thanks,
 -gordon

December 25, 2013

Re: Reserving/Preallocating associative array?

Posted by Andrei Alexandrescu
in reply to Gordon

Andrei Alexandrescu

Posted in reply to Gordon

On 12/25/13 8:29 AM, Gordon wrote:
> On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote:
>>
>> void main()
>> {
>>     size_t[size_t] unions;
>>     foreach (e; "input.txt"
>>             .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
>>         unions[e[0]] = e[1];
>>     }
>> }
>>
>
> Thanks for this very elegant solution!
>
> For completeness (since we're dealing with timing):
>
> 1. Running the above code with Garbage-collection enabled, takes 1m45s.
>
> 2. Running it with GC disabled takes 50s .
>
> 3. Running with GC disabled and without sort+uniq (my data is already
> uniq'd), takes 40s.

Thanks for the numbers!

> This is a beautiful idiomatic D code, but for now I think I'll stick
> with the more verbose code above.
>
> Ine reason is that I want to provide helpful and verbose error messages
> for invalid input (e.g. "Invalid numeric value found in field 3 line
> 1000") - and with the "slurp" method, any input error will result in a
> not-so-helpful exception (e.g.
> "std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(2009):
> Unexpected 'H' when converting from type char[] to type ulong).

Thanks, I added https://d.puremagic.com/issues/show_bug.cgi?id=11816 to keep track of this.


Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation