December 27, 2013
Am Wed, 25 Dec 2013 16:12:09 +0000
schrieb "Gordon" <me@home.com>:

> On Wednesday, 25 December 2013 at 08:27:53 UTC, Philippe Sigaud wrote:
> > Out of curiosity, do you know which one of his 3 suggestions
> > brought you
> > the highest speed boost? What happens if you do not disable the
> > GC, for
> > example?
> 
> Good question, I did test each one incrementally:
> 
> 1. Switching from "split+to!size_t" to "parse" (and commenting out the "union") reduced running time from ~26s to ~20s (on average).
> 
> 2. Switching from "byLine" to "byLineFast" (and commenting out the "union") reduced running time from ~20s to ~14s (on average).
> 
> 3. With "parse" and "byLineFast", and with GC still on, and populating the "union" the program took about 2m45s .
> 
> 4. Adding "GC.disable" brought it down to 25s.
> 
> HTH,
>   -gordon

So for the record: The garbage collector slowed down this piece of code by a factor of 6.6

-- 
Marco

December 27, 2013
Am 25.12.2013 17:12, schrieb Gordon:
> Good question, I did test each one incrementally:
>
> 1. Switching from "split+to!size_t" to "parse" (and commenting out the
> "union") reduced running time from ~26s to ~20s (on average).
>
> 2. Switching from "byLine" to "byLineFast" (and commenting out the
> "union") reduced running time from ~20s to ~14s (on average).
>
> 3. With "parse" and "byLineFast", and with GC still on, and populating
> the "union" the program took about 2m45s .
>
> 4. Adding "GC.disable" brought it down to 25s.
>
> HTH,
>   -gordon


So I was interrested how far you can take this. The version bearophile posted runs in 14 seconds (48420527 ticks) on my machine.

I created a new version using my own hashmap implementation and manual memory management (no GC present at all).
This version runs in 12 seconds (41614375 ticks)

Preallocating all the entries for the hashmap brings quite a boost. It brings it down to 9 seconds (32926550 ticks)

If you replace the default hash function D uses with the MurMur hash function it brings it down even more to 8 seconds (29425713 ticks)

I also tried not using any hash function at all, but that made the time skyrocket.

I stopped at that point, because now the most time is spend looking for a free entry in the hashmap, which pretty much comes down to cache-misses.

At this point the profiler looked something like this:
50%  find free entry in hashmap
21%  parse uint
9%   find end of line + tab
3.5% read data from disk


The ticks in my measurements have been obtained via QueryPerformanceCounter.

Kind Regards
Benjamin Thaut
December 27, 2013
Benjamin Thaut:

> If you replace the default hash function D uses with the MurMur hash function it brings it down even more to 8 seconds (29425713 ticks)

What compiler are you using? ldc2?

And is it a good idea to put MurMur in D?


> 50%  find free entry in hashmap
> 21%  parse uint

Perhaps the parse uint times can be improved. A lightly tested function to try:


uint toUint(const(char)[] txt) pure nothrow {
    auto p = txt.ptr;
    const q = p + txt.length;

    // Skip leading not-digits.
    while (p < q && (*p < '0' || *p > '9'))
        p++;

    uint result = 0;
    while (p < q && *p >= '0' && *p <= '9') {
        result = (result * 10) + (*p - '0');
        p++;
    }
    return result;
}

void main() {
    import std.stdio;
    "".toUint.writeln;
    "x".toUint.writeln;
    "-1".toUint.writeln;
    " 125 ".toUint.writeln;
    " 10000".toUint.writeln;
    " 1000000 ".toUint.writeln;
    " 10000000000000 ".toUint.writeln;
}


Its output (it's not a safe conversion):

0
0
1
125
10000
1000000
1316134912

Bye,
bearophile
December 27, 2013
Am 27.12.2013 12:02, schrieb bearophile:
> Benjamin Thaut:
>
>> If you replace the default hash function D uses with the MurMur hash
>> function it brings it down even more to 8 seconds (29425713 ticks)
>
> What compiler are you using? ldc2?

dmd 2.064.2

>
> And is it a good idea to put MurMur in D?

I don't know about the properties of the MurMur hash function. I only know that it is cheaper to compute then the hash function D uses. To decide if it would be better then the currently choosen hash function, a in depth analysis would be required.

>
>
>> 50%  find free entry in hashmap
>> 21%  parse uint
>
> Perhaps the parse uint times can be improved. A lightly tested function
> to try:
>
>
> uint toUint(const(char)[] txt) pure nothrow {
>      auto p = txt.ptr;
>      const q = p + txt.length;
>
>      // Skip leading not-digits.
>      while (p < q && (*p < '0' || *p > '9'))
>          p++;
>
>      uint result = 0;
>      while (p < q && *p >= '0' && *p <= '9') {
>          result = (result * 10) + (*p - '0');
>          p++;
>      }
>      return result;
> }

The way I parse the uint already looks that way. I'm not using std.conv
https://github.com/Ingrater/thBase/blob/master/src/thBase/conv.d#L22

Kind Regards
Benjamin Thaut


December 27, 2013
On Friday, 27 December 2013 at 09:37:09 UTC, Benjamin Thaut wrote:
> Am 25.12.2013 17:12, schrieb Gordon:
>
> So I was interrested how far you can take this. The version bearophile posted runs in 14 seconds (48420527 ticks) on my machine.
>
<...>
>
> At this point the profiler looked something like this:
> 50%  find free entry in hashmap
> 21%  parse uint
> 9%   find end of line + tab
> 3.5% read data from disk
>

This is turning into very interesting (and useful) discussion. Thank you!

I've created a generic "FileSlurp" library, to load a file (using "byLineFast") and use a delegate to store the data (so I can later use any improved methods you suggest).
It's available here, comments are welcomed:
  http://code.dlang.org/packages/fileslurp

Benjamin,
Can you point me to your Hashmap implementation? I could perhaps use it to improve the timings even more.

Bearophile,
Who is "JM" who wrote "byLineFast" ? I'm using his code and would like to properly acknowledge him/her.

Thanks!
 -gordon

December 27, 2013
Gordon:

> Who is "JM" who wrote "byLineFast" ? I'm using his code and would like to properly acknowledge him/her.

It should be Juan Manuel Cabo:
http://permalink.gmane.org/gmane.comp.lang.d.general/117750

Bye,
bearophile
December 27, 2013
Am 27.12.2013 17:49, schrieb Gordon:
>
> Benjamin,
> Can you point me to your Hashmap implementation? I could perhaps use it
> to improve the timings even more.

https://github.com/Ingrater/druntime/blob/merge64/src/core/hashmap.d

This implementation depends on my own allocator design, but it should be possible to remove that dependeny quite easly by replacing all allocations / frees with malloc/free or GC.malloc / nothing. Just make sure that memory is not initialized beforehand (as new ubyte[sizeInBytes] would do) because that also has a considerable performance impact. Also when allocating with GC.malloc you should specify the GC.BlkAttr.NO_SCAN flag, because you know that your data does not contain any pointers. That way the GC will not scan that huge memory block and it should speed up collection a lot.

To improve the performance of the hashmap you can try:
- specifingy different hashing functions (via the hash policy). See https://github.com/Ingrater/thBase/blob/master/src/thBase/policies/hashing.d for more examples of hashing policies.
- Change the amount of always free entries in the hashmap (currently 25%) for that change line 233, 207, 210 (not factored into a constant yet). Reducing the free entries might result in less cache misses, but more linear search, as this is a linear probing hashmap. Increasing the free entries might reduce the amount of linear search, but increase cache misses and increases memory usage.
- Compute a series of prime numbers, for which each prime number is at least twice as big as the previous one and use that as the possible sizes for the hashmap. Prime number sizes give better distribution of the items within the hashmap, therefor reduce the amount of linear searching neccessary and thus improve the hashmap performance.
- When searching for the next free spot in the hashmap, it currently just adds 1 to the previous index found (line 301). It is also possible to rehash the previous index (just put it into the hashing function again), which would give a new index that is more distant in the hashmap from the current one. This would again improve the distribution of the items in the hashmap, and thus reduce linear search time, but may increase the amount of cache-misses as it no longer does linear memory access.

In case you are interrested in my implementation of your problem see here:
http://dpaste.dzfl.pl/8d0d174e

You won't be able to compile the code unless you setup my custom D environment. Which I would not recommend unless you really hate the D garbage collector ;-)

Kind Regards
Benjamin Thaut
December 27, 2013
Also, gordon whats is keeping you from converting the textual representation to a binary one? I mean, if you need to read that file in often, you can easly preprocess it into a binary blob. You could even modify my hashmap implementation to support binary serialization, by just writing the entire array to a binary file and loading it again.
That would be fast...

Kind Regards
Benjamin Thaut
December 27, 2013
On Friday, 27 December 2013 at 09:37:09 UTC, Benjamin Thaut wrote:
> At this point the profiler looked something like this:
> 50%  find free entry in hashmap

When I'm building large immutable hash tables I do this:

Read in all the unordered key-value pairs into an array of Tuple!(Key, Value)

Determine the number of buckets you want (B)

Schwartz sort the array by hash(key) % B (a radix sort will probably be efficient here, but any will do)

This groups all items with the same bucket index together. You can then easily generate the array of bucket pointers by iterating the array.

Of course, you need your own hash table implementation to do this. You cannot do it using the built-in hash tables.
December 27, 2013
On Tuesday, 24 December 2013 at 22:28:21 UTC, Gordon wrote:
> Hello,
>
> I want to load a large text file containing two numeric fields into an associative array.
> The file looks like:
>    1   40
>    4   2
>    42  11
>    ...
>
> And has 11M lines.
>
> My code looks like this:
> ===
> void main()
> {
>         size_t[size_t] unions;
>         auto f = File("input.txt");
>         foreach ( line ; f.byLine() ) {
>                 auto fields = line.split();
>                 size_t i = to!size_t(fields[0]);
>                 size_t j = to!size_t(fields[1]);
>                 unions[i] = j; // <-- here be question
>         }
> }
> ===
>
> This is just a test code to illustrate my question (though general comments are welcomed - I'm new to D).
>
> Commenting out the highlighted line (not populating the hash), the program completes in 25 seconds.
> Compiling with the highlighted line, the program takes ~3.5 minutes.
>
> Is there a way to speed the loading? perhaps reserving memory in the hash before populating it? Or another trick?
>
> Many thanks,
>  -gordon

using OrderedAA improve speed 3x
https://github.com/Kozzi11/Trash/tree/master/util

import util.orderedaa;

int main(string[] args)
{
    import std.stdio, std.conv, std.string, core.memory;
    import bylinefast;

    GC.disable;
    OrderedAA!(size_t, size_t, 1_000_007) unions;
    //size_t[size_t] unions;
    foreach (line; "input.txt".File.byLineFast) {
        line.munch(" \t"); // skip ws
        immutable i = line.parse!size_t;
        line.munch(" \t"); // skip ws
        immutable j = line.parse!size_t;
        unions[i] = j;
    }
    GC.enable;
	
	return 0;
}