December 25, 2013
On Tuesday, 24 December 2013 at 23:52:49 UTC, Andrei Alexandrescu wrote:
> On 12/24/13 2:28 PM, Gordon wrote:
>> Hello,
>>
>> I want to load a large text file containing two numeric fields into an
>> associative array.
>> The file looks like:
>>    1   40
>>    4   2
>>    42  11
>>    ...
>>
>> And has 11M lines.
>>
>> My code looks like this:
>> ===
>> void main()
>> {
>>         size_t[size_t] unions;
>>         auto f = File("input.txt");
>>         foreach ( line ; f.byLine() ) {
>>                 auto fields = line.split();
>>                 size_t i = to!size_t(fields[0]);
>>                 size_t j = to!size_t(fields[1]);
>>                 unions[i] = j; // <-- here be question
>>         }
>> }
>> ===
>>
>> This is just a test code to illustrate my question (though general
>> comments are welcomed - I'm new to D).
>>
>> Commenting out the highlighted line (not populating the hash), the
>> program completes in 25 seconds.
>> Compiling with the highlighted line, the program takes ~3.5 minutes.
>>
>> Is there a way to speed the loading? perhaps reserving memory in the
>> hash before populating it? Or another trick?
>
> void main()
> {
>     size_t[size_t] unions;
>     foreach (e; "input.txt"
>             .slurp!(size_t, size_t)("%s %s").sort.uniq ) {
>         unions[e[0]] = e[1];
>     }
> }
>
>
> Andrei

watch out for the parenthsesis on sort. As bearophile likes to point out frequently, without parenthesis you are calling the builtin sort, not the std.algorithm one.


Gordon, you may find this has better performance if you add () to sort.
December 25, 2013
On 12/25/13 10:25 AM, John Colvin wrote:
> watch out for the parenthsesis on sort. As bearophile likes to point out
> frequently, without parenthesis you are calling the builtin sort, not
> the std.algorithm one.

Ew, good point. Thanks.

> Gordon, you may find this has better performance if you add () to sort.

Also, can you share your data somewhere if it's not confidential?


Andrei

December 25, 2013
On Wed, Dec 25, 2013 at 10:51:23AM -0800, Andrei Alexandrescu wrote:
> On 12/25/13 10:25 AM, John Colvin wrote:
> >watch out for the parenthsesis on sort. As bearophile likes to point out frequently, without parenthesis you are calling the builtin sort, not the std.algorithm one.
> 
> Ew, good point. Thanks.
[...]

Is the built-in sort deprecated yet? If it is, when can we finally say bye-bye to it?


T

-- 
Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.
December 25, 2013
On Wednesday, 25 December 2013 at 18:51:22 UTC, Andrei Alexandrescu wrote:
> On 12/25/13 10:25 AM, John Colvin wrote:
>
>> Gordon, you may find this has better performance if you add () to sort.
>
> Also, can you share your data somewhere if it's not confidential?
>

It will be related to this research: http://www.familinx.org .
The data is available in the "download" section as an SQL dump (which can be easily converted to just tabular files).

I've also added verbose errors to "std.file.slurp()", which (IMHO) provide all the necessary information when an error occurs.
Available here:
 https://github.com/agordon/phobos/compare/file.slurp.verbose

But this has lead me to detect some "formattedRead" quirks - I'll start a new thread about those.

Thanks,
 -gordon

December 25, 2013
On Wed, Dec 25, 2013 at 5:47 PM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> On 12/25/13 8:29 AM, Gordon wrote:
>> For completeness (since we're dealing with timing):
>>
>> 1. Running the above code with Garbage-collection enabled, takes 1m45s.
>>
>> 2. Running it with GC disabled takes 50s .
>>
>> 3. Running with GC disabled and without sort+uniq (my data is already
>> uniq'd), takes 40s.
>
>
> Thanks for the numbers!

Yes, indeed, thanks. I would have thought sorting (particularly the built-in sort) + uniq would 'cost' more.
December 25, 2013
H. S. Teoh:

> Is the built-in sort deprecated yet? If it is, when can we finally say bye-bye to it?

https://d.puremagic.com/issues/show_bug.cgi?id=10318

Bye,
bearophile
December 26, 2013
>     GC.disable;
>     size_t[size_t] unions;
>     foreach (line; "input.txt".File.byLineFast) {
>         line.munch(" \t"); // skip ws
>         immutable i = line.parse!size_t;
>         line.munch(" \t"); // skip ws
>         immutable j = line.parse!size_t;
>         unions[i] = j;
>     }
>     GC.enable;
> }

A question for people that know something about the D garbage collector: can it be added some logic to the GC to detect the situations where you have to allocate many times very frequently, and disable or rarefy the collection frequency, to increase the code performance a lot? (I think "recently" the Python garbage collector has added such optimization). This way that enabling and disabling the GC becomes not needed any more.

Bye,
bearophile
December 26, 2013
On Thursday, 26 December 2013 at 14:09:29 UTC, bearophile wrote:

> A question for people that know something about the D garbage collector: can it be added some logic to the GC to detect the situations where you have to allocate many times very frequently, and disable or rarefy the collection frequency, to increase the code performance a lot? (I think "recently" the Python garbage collector has added such optimization). This way that enabling and disabling the GC becomes not needed any more.

I think it's possible. During the sweep phase where GC returns unmarked memory blocks (garbage) to the free lists, it can calculate total amount of freed memory at this stage. Probably this stat is already being calculated. It also knows total size of its heap, so after a collection it can know how much % of the heap was freed. When the program is in a state where it mostly allocates, this % will be very low compared to usual case. When GC sees it's so low it can increase the effective threshold for the next collection, making collections less often. I think actual parameter will be the size of pool allocated from system memory. AFAIR, currently GC is triggered when there are not enough free blocks in the current pool, and if after collection there is still not enough free space a new pool gets allocated. Its size is currently constant, making long allocating loops run in O(n^2) but if the pool size will grow in cases where % of collected memory is low and shrink back when % of freed space is bigger again, then GC will adapt to allocation rate and such scenarios will work faster.
December 26, 2013
Am 25.12.2013 21:26, schrieb Gordon:
> On Wednesday, 25 December 2013 at 18:51:22 UTC, Andrei Alexandrescu wrote:
>> On 12/25/13 10:25 AM, John Colvin wrote:
>>
>>> Gordon, you may find this has better performance if you add () to sort.
>>
>> Also, can you share your data somewhere if it's not confidential?
>>
>
> It will be related to this research: http://www.familinx.org .
> The data is available in the "download" section as an SQL dump (which
> can be easily converted to just tabular files).
>
> I've also added verbose errors to "std.file.slurp()", which (IMHO)
> provide all the necessary information when an error occurs.
> Available here:
>   https://github.com/agordon/phobos/compare/file.slurp.verbose
>
> But this has lead me to detect some "formattedRead" quirks - I'll start
> a new thread about those.
>
> Thanks,
>   -gordon
>

Did you use D to convert that SQL dump to tabular data? If so, could you post the small snippet? If not did you convert the "relationship" table?

Kind Regards
Benjamin Thaut
December 26, 2013
On Thursday, 26 December 2013 at 18:07:38 UTC, Benjamin Thaut wrote:
>
> Did you use D to convert that SQL dump to tabular data? If so, could you post the small snippet? If not did you convert the "relationship" table?

No, I use "D" for some internal experimentation.
To get the data as tabular I loaded the SQL dump to a MySQL server, then used "mysqldump --tab" to export the tables I needed.

HTH,
 -gordon