Thread overview | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 30, 2017 howto count lines - fast | ||||
---|---|---|---|---|
| ||||
After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D. So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l". How would a implementation look like in D, which is fast? |
May 30, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nitram | On Tuesday, 30 May 2017 at 20:02:38 UTC, Nitram wrote: > After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D. > > So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l". > > How would a implementation look like in D, which is fast? Not sure if this is the fastest, but anyway void main(){ import std.file : read; import std.conv : to; import std.algorithm : count; auto data = cast(ubyte[])read("somefile.txt"); auto lc = data.count('\n'.to!ubyte); } Jordan |
May 30, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jordan Wilson | On Tuesday, 30 May 2017 at 20:37:44 UTC, Jordan Wilson wrote: > On Tuesday, 30 May 2017 at 20:02:38 UTC, Nitram wrote: >> After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D. >> >> So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l". >> >> How would a implementation look like in D, which is fast? > > Not sure if this is the fastest, but anyway > > void main(){ > import std.file : read; > import std.conv : to; > import std.algorithm : count; > > auto data = cast(ubyte[])read("somefile.txt"); > auto lc = data.count('\n'.to!ubyte); > } > > Jordan I should say, if you don't care about storing the data, File("somefile.txt","r").byLine.count is probably more idiomatic, and probably just as fast. Jordan |
May 30, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nitram | I do not know this is my first attempt and it is almost same fast as wc on my pc: int main(string[] args) { import std.stdio : writeln, writefln, File; import std.array : uninitializedArray; auto f = File("data"); size_t c = 0; auto buffer = uninitializedArray!(ubyte[])(1024); foreach (chunk; f.byChunk(buffer)) { foreach (ch; chunk) if (ch == '\n') ++c; } writeln(c); return 0; } And I belive it could be faster when iopipe is used Dne 30.5.2017 v 22:02 Nitram via Digitalmars-d-learn napsal(a): > After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D. > > So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l". > > How would a implementation look like in D, which is fast? > |
May 30, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nitram | On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via Digitalmars-d-learn wrote: > After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D. > > So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l". > > How would a implementation look like in D, which is fast? This little challenge piqued my interest. So I decided to take a shot at seeing if I could beat my system's /usr/bin/wc -l. First order of business: whenever it comes to performance, always choose the right compiler for the job. While dmd is the reference compiler and has the latest and greatest features, it doesn't do too well in the performance department. So the first thing I did was to use gdc instead. Specifically, gdc -O3, to get the most aggressive optimizations the gcc backend can give me. Second order of business: the naïve algorithm of reading 1 character at a time from the file is obviously suboptimal, so I didn't even consider it. Third order of business: I constructed a large text file containing 10634420 lines by concatenating 5 copies of a large CSV file I obtained online. The sample had to be large enough so that overhead like program startup times and background system noise don't bias the test results too much. With this setup, I measured the performance of my system's wc -l as the baseline for comparing the performance of the D solutions. Here's a typical output of my shell's `time wc -l` command: real 0m0.344s user 0m0.153s sys 0m0.191s Since we're talking about beating wc, and others have already posted the obvious solutions of loading into a buffer and scanning the buffer, the most obvious next step up is to use std.mmfile to memory-map the file so that we don't waste effort managing file buffers ourselves: let the OS do it for us. So here are the algorithms I tried: 1) lineCount1(): use std.mmfile to memory-map the file so that we can scan it as a single, contiguous array without any fussy buffer-management details. Result: pretty fast, but still loses to wc. Typical measurements: real 0m0.422s user 0m0.366s sys 0m0.054s 2) lineCount2(): just as a comparison, I did a read-into-buffer algorithm so that we have a reference as to how it performs. As expected, it's worse than the std.mmfile solution. Typical measurements: real 0m0.519s user 0m0.320s sys 0m0.198s 3) lineCount3(): In lineCount1(), I used std.algorithm.searching.count for counting newlines; so I thought, what if the range abstraction is introducing too much overhead? So I decided to write a simple foreach loop instead, in hope that the simpler code will allow the gcc optimizer do a better job. Sadly, the result is pretty much the same as lineCount1: real 0m0.421s user 0m0.379s sys 0m0.042s (The good news, though, is that this shows that using std.algorithm does not introduce excessive overhead compared to a hand-written loop. This proves that the high-level range abstraction does not necessarily equate to slower performance.) 4) lineCount4(): Having failed to beat wc thus far, it's time to bring out the big guns. Since we're essentially counting newlines in the input, who says we have to process the data sequentially from start to end? Counting from front to back or back to front gives the same answer, as does counting from middle to end, then front to middle. In particular, counting from middle to end *in parallel* with counting from front to middle, then summing the subtotals, gives the same answer. I.e., time to bust out std.parallelism. :-) So, in my 4th algorithm, I split the input into 16KB chunks, and then scan them for newlines in parallel, creating file_size/16384 subtotals. Then I sum the subtotals in the end. Here's the result, tested on my AMD Phenom 6-core CPU: real 0m0.242s user 0m1.151s sys 0m0.057s Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair margin, too. Eat your heart out, wc!!! (The large user time is because we're using all 6 cores at once. But the actual elapsed time is shorter.) Here's the code for all of the above: ---------snip--------- import std.stdio; // real 0m0.422s // user 0m0.366s // sys 0m0.054s size_t lineCount1(string filename) { import std.algorithm.searching : count; import std.mmfile; auto f = new MmFile(filename); auto data = cast(ubyte[]) f[]; return data.count('\n'); } // real 0m0.519s // user 0m0.320s // sys 0m0.198s size_t lineCount2(string filename) { import std.algorithm.searching : count; auto f = File(filename); ubyte[] buf; size_t c = 0; buf.length = 32768; do { auto d = f.rawRead(buf); c += d.count('\n'); } while (!f.eof()); return c; } // real 0m0.421s // user 0m0.379s // sys 0m0.042s size_t lineCount3(string filename) { import std.mmfile; auto f = new MmFile(filename); auto data = cast(ubyte[]) f[]; size_t c; foreach (i; 0 .. data.length) { if (data[i] == '\n') c++; } return c; } // real 0m0.242s // user 0m1.151s // sys 0m0.057s size_t lineCount4(string filename) { import std.algorithm.comparison : min; import std.algorithm.iteration : sum; import std.mmfile; import std.parallelism; import std.range : chunks; auto f = new MmFile(filename); auto data = cast(ubyte[]) f[]; size_t[] subtotals; enum blockSize = 16384; if (data.length == 0) return 0; subtotals.length = 1 + (data.length - 1) / blockSize; foreach (j, ref subtotal; parallel(subtotals)) { size_t end = min(blockSize*(j+1), data.length); foreach (i; blockSize*j .. end) { if (data[i] == '\n') subtotal++; } } return subtotals.sum; } int main(string[] args) { if (args.length < 2) { stderr.writeln("Please specify target file."); return 1; } auto filename = args[1]; //auto count = lineCount1(filename); //auto count = lineCount2(filename); //auto count = lineCount3(filename); auto count = lineCount4(filename); writeln(count); return 0; } // vim:set sw=4 ts=4 et ai: ---------snip--------- T -- BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL |
May 30, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nitram | On 05/30/2017 01:02 PM, Nitram wrote:
> After reading
> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i
> was wondering how fast one can do a simple "wc -l" in D.
>
> So i made a couple short implementations and found myself confronted
> with slow results compared to "/usr/bin/wc -l".
>
> How would a implementation look like in D, which is fast?
>
I could not make the D program come close to wc's performance when the data was piped from stdin. A typical run with Daniel Kozak's program:
$ time cat deleteme.txt | wc -l
5062176
real 0m0.086s
user 0m0.068s
sys 0m0.048s
$ time cat deleteme.txt | ./deneme
5062176
real 0m0.174s
user 0m0.152s
sys 0m0.072s
Ali
|
May 30, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
P.S. After I posted the code, I took a closer look at the disassembly and found that gdc wasn't generating the best code for the parallel foreach loop body. I haven't fully traced the cause yet, but I did find a simple optimization (arguably a micro-optimization): updating the subtotal inside the inner loop is a bit inefficient, because the subtotal is outside the loop body and, due to the way std.parallelism is implemented, is passed by reference to the loop body. This caused gdc not to enregister the subtotal, so incrementing it requires a cache roundtrip at best, a full RAM roundtrip at worst.
So I introduced a local count variable for incrementing, and only assign to the subtotal array at the end of the block:
-------snip-------
// real 0m0.240s
// user 0m1.045s
// sys 0m0.059s
size_t lineCount4(string filename)
{
import std.algorithm.comparison : min;
import std.algorithm.iteration : sum;
import std.mmfile;
import std.parallelism;
auto f = new MmFile(filename);
auto data = cast(ubyte[]) f[];
size_t[] subtotals;
enum blockSize = 16384;
if (data.length == 0) return 0;
subtotals.length = 1 + (data.length - 1) / blockSize;
foreach (j, ref subtotal; parallel(subtotals))
{
size_t end = min(blockSize*(j+1), data.length);
size_t c = 0;
foreach (i; blockSize*j .. end)
{
if (data[i] == '\n') c++;
}
subtotal = c;
}
return subtotals.sum;
}
-------snip-------
A look at the disassembly showed that gdc has enregistered the subtotal, and thereby able to eliminate a branch from the inner loop using a sete instruction and an add for the if statement. The measured performance is only marginally better, though, and doesn't change the overall performance in a significant way. But I thought this might be a useful tip for people who want to squeeze out the last drop of juice from their CPUs. ;-)
Next, I'll have to look into why gdc didn't unroll the inner loop, like I thought it would. But I'm out of time now, so that will have to wait for another day.
T
--
The peace of mind---from knowing that viruses which exploit Microsoft system vulnerabilities cannot touch Linux---is priceless. -- Frustrated system administrator.
|
May 31, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 30 May 2017 at 23:41:01 UTC, H. S. Teoh wrote: > This little challenge piqued my interest. So I decided to take a shot at seeing if I could beat my system's /usr/bin/wc -l. > > First order of business: whenever it comes to performance, always choose the right compiler for the job... > > Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair margin, too. Eat your heart out, wc!!! (The large user time is because we're using all 6 cores at once. But the actual elapsed time is shorter.) Hm... I cheated a little bit: took your program and compiled with `ldc2 -release -O3 -mcpu=skylake`. For data I took std.datetime concatenated 1000 times (35446000 lines). Results (minimum of 10 runs each): wc -l real 0.50 user 0.40 sys 0.09 lineCount1 real 0.23 user 0.19 sys 0.04 lineCount2 real 0.29 user 0.17 sys 0.12 lineCount3 real 0.23 user 0.18 sys 0.04 lineCount4 real 0.22 user 1.52 sys 0.04 Seems like all of them beat wc, so machine and compiler matter a great deal in this comparison. Thing is, on small files wc is going to win pretty much always, due to D's clunky startup. But with larger ones - wc falls far behind. Interestingly, both lineCount1 and lineCount3 on this machine are nearly even with lineCount4 :) |
May 31, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ali Çehreli | Dne 31.5.2017 v 02:13 Ali Çehreli via Digitalmars-d-learn napsal(a):
> I could not make the D program come close to wc's performance when the data was piped from stdin. A typical run with Daniel Kozak's program:
>
> $ time cat deleteme.txt | wc -l
> 5062176
>
> real 0m0.086s
> user 0m0.068s
> sys 0m0.048s
>
> $ time cat deleteme.txt | ./deneme
> 5062176
>
> real 0m0.174s
> user 0m0.152s
> sys 0m0.072s
>
> Ali
>
How do you compile it? When I use ldc2 -O3 -release -mcpu=bdver1 lc.d my code is even faster than wc
|
May 31, 2017 Re: howto count lines - fast | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Kozak | On 05/30/2017 11:50 PM, Daniel Kozak via Digitalmars-d-learn wrote: > How do you compile it? When I use ldc2 -O3 -release -mcpu=bdver1 lc.d > my code is even faster than wc My bad: I'm not familiar with ldc's optimization options. (I used -O3 but not -release) Now I get the same performance as 'wc -l' when I add -release. Ali |
Copyright © 1999-2021 by the D Language Foundation