D is for Data Science

Just browsing reddit and found this article posted about D. Written by Andrew Pascoe of AdRoll. From the article: "The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer." Article: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Reddit: http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/

On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote: > Just browsing reddit and found this article posted about D. > Written by Andrew Pascoe of AdRoll. > > From the article: > "The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer." > > Article: > http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html > > Reddit: > http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/ Why is File.byLine so slow? Having to work around the standard library defeats the point of a standard library.

November 24, 2014

Re: D is for Data Science

Posted by Dmitry Olshansky
in reply to weaselcat

Permalink

Dmitry Olshansky

Posted in reply to weaselcat

Permalink

25-Nov-2014 00:34, weaselcat пишет:
> On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:
>> Just browsing reddit and found this article posted about D.
>> Written by Andrew Pascoe of AdRoll.
>>
>> From the article:
>> "The D programming language has quickly become our language of choice
>> on the Data Science team for any task that requires efficiency, and is
>> now the keystone language for our critical infrastructure. Why?
>> Because D has a lot to offer."
>>
>> Article:
>> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
>>

Quoting the article:

> One of the best things we can do is minimize the amount of memory we’re allocating; we allocate a new char[] every time we read a line.

This is wrong. byLine reuses buffer if its mutable which is the case with char[]. I recommend authors to always double checking hypothesis before stating it in article, especially about performance.

Observe:
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652

And notice a warning about reusing the buffer here:

https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741

>> Reddit:
>> http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/
>>
>
> Why is File.byLine so slow?

Seems to be mostly fixed sometime ago. It's slower then straight fgets but it's not that bad.

Also nearly optimal solution using C's fgets with growable buffer is way simpler then outlined code in the article. Or we can mmap the file too.

> Having to work around the standard library
> defeats the point of a standard library.

Truth be told the most of slowdown should be in eager split, notably with GC allocation per line. It may also trigger GC collection after splitting many lines, maybe even many collections.

The easy way out is to use standard _splitter_ which is lazy and non-allocating.  Which is a _2-letter_ change, and still using nice clean standard function.

Article was really disappointing for me because I expected to see that single line change outlined above to fix the 80% of problem elegantly. Instead I observe 100+ spooky lines that needlessly maintain 3 buffers at the same time (how scientific) instead of growing single one to amortize the cost. And then a claim that's nice to be able to improve speed so easily.

-- 
Dmitry Olshansky

On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote: > Just browsing reddit and found this article posted about D. > Written by Andrew Pascoe of AdRoll. > > From the article: > "The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer." > > Article: > http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html > > Reddit: > http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/ Is this related? https://github.com/dscience-developers/dscience

On Monday, 24 November 2014 at 23:32:14 UTC, Jay Norwood wrote: > Is this related? > > https://github.com/dscience-developers/dscience This seems good too. Why the comments in the discussion about lack of libraries? https://github.com/kyllingstad/scid/wiki

25-Nov-2014 01:28, bearophile пишет: > Dmitry Olshansky: > >>> Why is File.byLine so slow? >> >> Seems to be mostly fixed sometime ago. > > Really? I am not so sure. > > Bye, > bearophile I too has suspected it in the past and then I tested it. Now I test it again, it's always easier to check then to argue. Two minimal programs //my.d: import std.stdio; void main(string[] args) { auto file = File(args[1], "r"); size_t cnt=0; foreach(char[] line; file.byLine()) { cnt++; } } //my2.d import core.stdc.stdio; void main(string[] args) { char[] buf = new char[32768]; size_t cnt; shared(FILE)* file = fopen(args[1].ptr, "r"); while(fgets(buf.ptr, cast(int)buf.length, file) != null){ cnt++; } fclose(file); } In the below console session, log file - is my dmsg log replicated many times (34 megs total). dmitry@Ubu64 ~ $ wc -l log 522240 log dmitry@Ubu64 ~ $ du -hs log 34M log # touch it, to have it in disk cache: dmitry@Ubu64 ~ $ cat log > /dev/null dmitry@Ubu64 ~ $ dmd my dmitry@Ubu64 ~ $ dmd my2 dmitry@Ubu64 ~ $ time ./my2 log real 0m0.062s user 0m0.039s sys 0m0.023s dmitry@Ubu64 ~ $ time ./my log real 0m0.181s user 0m0.155s sys 0m0.025s ~4 time in user mode, okay... Now with full optimizations, ranges are very sensitive to optimizations: dmitry@Ubu64 ~ $ dmd -O -release -inline my dmitry@Ubu64 ~ $ dmd -O -release -inline my2 dmitry@Ubu64 ~ $ time ./my2 log real 0m0.065s user 0m0.042s sys 0m0.023s dmitry@Ubu64 ~ $ time ./my2 log real 0m0.063s user 0m0.040s sys 0m0.023s Which is 1:1 parity. Another myth busted? ;) -- Dmitry Olshansky

Dmitry Olshansky: > Which is 1:1 parity. Another myth busted? ;) There is still an open bug report: https://issues.dlang.org/show_bug.cgi?id=11810 Do you want also to benchmark that byLineFast that for me is usually significantly faster than the byLine? Bye, bearophile

25-Nov-2014 02:43, bearophile пишет: > Dmitry Olshansky: > >> Which is 1:1 parity. Another myth busted? ;) > dmitry@Ubu64 ~ $ time ./my2 log > > real 0m0.065s > user 0m0.042s > sys 0m0.023s > dmitry@Ubu64 ~ $ time ./my2 log > > real 0m0.063s > user 0m0.040s > sys 0m0.023s > Read the above more carefully. OMG. I really need to watch my fingers, and double-check:) dmitry@Ubu64 ~ $ time ./my log real 0m0.156s user 0m0.130s sys 0m0.026s dmitry@Ubu64 ~ $ time ./my2 log real 0m0.063s user 0m0.040s sys 0m0.023s Which is quite bad. Optimizations do help but not much. > > There is still an open bug report: > https://issues.dlang.org/show_bug.cgi?id=11810 > > Do you want also to benchmark that byLineFast that for me is usually > significantly faster than the byLine? > And it seems like byLineFast is indeed fast. dmitry@Ubu64 ~ $ time ./my3 log real 0m0.056s user 0m0.031s sys 0m0.025s dmitry@Ubu64 ~ $ time ./my2 log real 0m0.065s user 0m0.041s sys 0m0.024s Now once I was destroyed the question is who is going to make a PR of this? -- Dmitry Olshansky

Forums