Jump to page: 1 24  
Page
Thread overview
D is for Data Science
Nov 24, 2014
Gary Willoughby
Nov 24, 2014
weaselcat
Nov 24, 2014
Dmitry Olshansky
Nov 24, 2014
bearophile
Nov 24, 2014
Dmitry Olshansky
Nov 24, 2014
bearophile
Nov 25, 2014
Dmitry Olshansky
Nov 24, 2014
Walter Bright
Nov 24, 2014
Jay Norwood
Nov 24, 2014
Jay Norwood
Nov 25, 2014
Walter Bright
Nov 25, 2014
Adam D. Ruppe
Nov 25, 2014
Walter Bright
Nov 25, 2014
Paolo Invernizzi
Nov 25, 2014
Kagamin
Nov 25, 2014
ketmar
Nov 28, 2014
Chris
Nov 28, 2014
Chris
Nov 25, 2014
weaselcat
Nov 25, 2014
bearophile
Nov 28, 2014
Daniel Murphy
Nov 28, 2014
Iain Buclaw
Nov 28, 2014
Tomer Rosenschtein
Nov 28, 2014
bearophile
Nov 28, 2014
Tomer Rosenschtein
Nov 28, 2014
Tomer Rosenschtein
Nov 28, 2014
CraigDillabaugh
Nov 28, 2014
Tomer Rosenschtein
Nov 28, 2014
CraigDillabaugh
Nov 29, 2014
Tomer Rosenschtein
Nov 29, 2014
Tomer Rosenschtein
Nov 28, 2014
bearophile
November 24, 2014
Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.

From the article:
"The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer."

Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/
November 24, 2014
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:
> Just browsing reddit and found this article posted about D.
> Written by Andrew Pascoe of AdRoll.
>
> From the article:
> "The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer."
>
> Article:
> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
>
> Reddit:
> http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/

Why is File.byLine so slow? Having to work around the standard library defeats the point of a standard library.
November 24, 2014
25-Nov-2014 00:34, weaselcat пишет:
> On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:
>> Just browsing reddit and found this article posted about D.
>> Written by Andrew Pascoe of AdRoll.
>>
>> From the article:
>> "The D programming language has quickly become our language of choice
>> on the Data Science team for any task that requires efficiency, and is
>> now the keystone language for our critical infrastructure. Why?
>> Because D has a lot to offer."
>>
>> Article:
>> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
>>

Quoting the article:

> One of the best things we can do is minimize the amount of memory we’re allocating; we allocate a new char[] every time we read a line.

This is wrong. byLine reuses buffer if its mutable which is the case with char[]. I recommend authors to always double checking hypothesis before stating it in article, especially about performance.

Observe:
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652

And notice a warning about reusing the buffer here:

https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741

>> Reddit:
>> http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/
>>
>
> Why is File.byLine so slow?

Seems to be mostly fixed sometime ago. It's slower then straight fgets but it's not that bad.

Also nearly optimal solution using C's fgets with growable buffer is way simpler then outlined code in the article. Or we can mmap the file too.

> Having to work around the standard library
> defeats the point of a standard library.

Truth be told the most of slowdown should be in eager split, notably with GC allocation per line. It may also trigger GC collection after splitting many lines, maybe even many collections.

The easy way out is to use standard _splitter_ which is lazy and non-allocating.  Which is a _2-letter_ change, and still using nice clean standard function.

Article was really disappointing for me because I expected to see that single line change outlined above to fix the 80% of problem elegantly. Instead I observe 100+ spooky lines that needlessly maintain 3 buffers at the same time (how scientific) instead of growing single one to amortize the cost. And then a claim that's nice to be able to improve speed so easily.


-- 
Dmitry Olshansky
November 24, 2014
Dmitry Olshansky:

>> Why is File.byLine so slow?
>
> Seems to be mostly fixed sometime ago.

Really? I am not so sure.

Bye,
bearophile
November 24, 2014
On 11/24/2014 2:25 PM, Dmitry Olshansky wrote:
> [...]

Excellent comments. Please post them on the reddit page!

November 24, 2014
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:
> Just browsing reddit and found this article posted about D.
> Written by Andrew Pascoe of AdRoll.
>
> From the article:
> "The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer."
>
> Article:
> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
>
> Reddit:
> http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/

Is this related?

https://github.com/dscience-developers/dscience


November 24, 2014
On Monday, 24 November 2014 at 23:32:14 UTC, Jay Norwood wrote:

> Is this related?
>
> https://github.com/dscience-developers/dscience

This seems good too.  Why the comments in the discussion about lack of libraries?

https://github.com/kyllingstad/scid/wiki


November 24, 2014
25-Nov-2014 01:28, bearophile пишет:
> Dmitry Olshansky:
>
>>> Why is File.byLine so slow?
>>
>> Seems to be mostly fixed sometime ago.
>
> Really? I am not so sure.
>
> Bye,
> bearophile

I too has suspected it in the past and then I tested it.
Now I test it again, it's always easier to check then to argue.

Two minimal programs
//my.d:
import std.stdio;

void main(string[] args) {
    auto file = File(args[1], "r");
    size_t cnt=0;
    foreach(char[] line; file.byLine()) {
        cnt++;
    }
}
//my2.d
import core.stdc.stdio;

void main(string[] args) {
    char[] buf = new char[32768];
    size_t cnt;
    shared(FILE)* file = fopen(args[1].ptr, "r");
    while(fgets(buf.ptr, cast(int)buf.length, file) != null){
        cnt++;
    }
    fclose(file);
}

In the below console session, log file - is my dmsg log replicated many times (34 megs total).

dmitry@Ubu64 ~ $ wc -l log
522240 log
dmitry@Ubu64 ~ $ du -hs log
34M	log

# touch it, to have it in disk cache:
dmitry@Ubu64 ~ $ cat log > /dev/null

dmitry@Ubu64 ~ $ dmd my
dmitry@Ubu64 ~ $ dmd my2

dmitry@Ubu64 ~ $ time ./my2 log

real	0m0.062s
user	0m0.039s
sys	0m0.023s
dmitry@Ubu64 ~ $ time ./my log

real	0m0.181s
user	0m0.155s
sys	0m0.025s

~4 time in user mode, okay...
Now with full optimizations, ranges are very sensitive to optimizations:

dmitry@Ubu64 ~ $ dmd -O -release -inline  my
dmitry@Ubu64 ~ $ dmd -O -release -inline  my2
dmitry@Ubu64 ~ $ time ./my2 log

real	0m0.065s
user	0m0.042s
sys	0m0.023s
dmitry@Ubu64 ~ $ time ./my2 log

real	0m0.063s
user	0m0.040s
sys	0m0.023s

Which is 1:1 parity. Another myth busted? ;)

-- 
Dmitry Olshansky
November 24, 2014
Dmitry Olshansky:

> Which is 1:1 parity. Another myth busted? ;)

There is still an open bug report:
https://issues.dlang.org/show_bug.cgi?id=11810

Do you want also to benchmark that byLineFast that for me is usually significantly faster than the byLine?

Bye,
bearophile
November 25, 2014
25-Nov-2014 02:43, bearophile пишет:
> Dmitry Olshansky:
>
>> Which is 1:1 parity. Another myth busted? ;)

> dmitry@Ubu64 ~ $ time ./my2 log
>
> real    0m0.065s
> user    0m0.042s
> sys    0m0.023s
> dmitry@Ubu64 ~ $ time ./my2 log
>
> real    0m0.063s
> user    0m0.040s
> sys    0m0.023s
>

Read the above more carefully.
OMG. I really need to watch my fingers, and double-check:)

dmitry@Ubu64 ~ $ time ./my log

real	0m0.156s
user	0m0.130s
sys	0m0.026s

dmitry@Ubu64 ~ $ time ./my2 log

real    0m0.063s
user    0m0.040s
sys    0m0.023s

Which is quite bad. Optimizations do help but not much.

>
> There is still an open bug report:
> https://issues.dlang.org/show_bug.cgi?id=11810
>
> Do you want also to benchmark that byLineFast that for me is usually
> significantly faster than the byLine?
>

And it seems like byLineFast is indeed fast.

dmitry@Ubu64 ~ $ time ./my3 log

real	0m0.056s
user	0m0.031s
sys	0m0.025s
dmitry@Ubu64 ~ $ time ./my2 log

real	0m0.065s
user	0m0.041s
sys	0m0.024s


Now once I was destroyed the question is who is going to make a PR of this?

-- 
Dmitry Olshansky
« First   ‹ Prev
1 2 3 4