Jump to page: 1 2 3
Thread overview
howto count lines - fast
May 30, 2017
Nitram
May 30, 2017
Jordan Wilson
May 30, 2017
Jordan Wilson
May 30, 2017
Daniel Kozak
May 30, 2017
H. S. Teoh
May 31, 2017
Stanislav Blinov
May 31, 2017
Patrick Schluter
May 31, 2017
Nitram
May 31, 2017
Ali Çehreli
May 31, 2017
Daniel Kozak
May 31, 2017
Ali Çehreli
May 31, 2017
cym13
May 31, 2017
H. S. Teoh
May 31, 2017
Jonathan M Davis
May 31, 2017
H. S. Teoh
Jun 01, 2017
Patrick Schluter
Jun 01, 2017
H. S. Teoh
May 31, 2017
H. S. Teoh
May 31, 2017
Russel Winder
May 31, 2017
H. S. Teoh
Jun 01, 2017
Jonathan M Davis
Jun 01, 2017
Patrick Schluter
Jun 01, 2017
Jonathan M Davis
Jun 01, 2017
H. S. Teoh
[OT] Re: howto count lines - fast
Jun 01, 2017
Russel Winder
Jun 01, 2017
Jonathan M Davis
Jun 01, 2017
Russel Winder
Jun 02, 2017
H. S. Teoh
Jun 03, 2017
Russel Winder
Jun 03, 2017
H. S. Teoh
May 30, 2017
After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.

So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l".

How would a implementation look like in D, which is fast?

May 30, 2017
On Tuesday, 30 May 2017 at 20:02:38 UTC, Nitram wrote:
> After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.
>
> So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l".
>
> How would a implementation look like in D, which is fast?

Not sure if this is the fastest, but anyway

void main(){
    import std.file : read;
    import std.conv : to;
    import std.algorithm : count;

    auto data = cast(ubyte[])read("somefile.txt");
    auto lc = data.count('\n'.to!ubyte);
}

Jordan
May 30, 2017
On Tuesday, 30 May 2017 at 20:37:44 UTC, Jordan Wilson wrote:
> On Tuesday, 30 May 2017 at 20:02:38 UTC, Nitram wrote:
>> After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.
>>
>> So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l".
>>
>> How would a implementation look like in D, which is fast?
>
> Not sure if this is the fastest, but anyway
>
> void main(){
>     import std.file : read;
>     import std.conv : to;
>     import std.algorithm : count;
>
>     auto data = cast(ubyte[])read("somefile.txt");
>     auto lc = data.count('\n'.to!ubyte);
> }
>
> Jordan

I should say, if you don't care about storing the data, File("somefile.txt","r").byLine.count is probably more idiomatic, and probably just as fast.

Jordan
May 30, 2017
I do not know this is my first attempt and it is almost same fast as wc on my pc:

int main(string[] args)
{
    import std.stdio : writeln, writefln, File;
    import std.array : uninitializedArray;

    auto f = File("data");
    size_t c = 0;
    auto buffer = uninitializedArray!(ubyte[])(1024);
    foreach (chunk; f.byChunk(buffer))
    {
        foreach (ch; chunk)
            if (ch == '\n')
                ++c;
    }
    writeln(c);
    return 0;
}

And I belive it could be faster when iopipe is used


Dne 30.5.2017 v 22:02 Nitram via Digitalmars-d-learn napsal(a):
> After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.
>
> So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l".
>
> How would a implementation look like in D, which is fast?
>

May 30, 2017
On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via Digitalmars-d-learn wrote:
> After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.
> 
> So i made a couple short implementations and found myself confronted with slow results compared to "/usr/bin/wc -l".
> 
> How would a implementation look like in D, which is fast?

This little challenge piqued my interest.  So I decided to take a shot at seeing if I could beat my system's /usr/bin/wc -l.

First order of business: whenever it comes to performance, always choose the right compiler for the job.  While dmd is the reference compiler and has the latest and greatest features, it doesn't do too well in the performance department.  So the first thing I did was to use gdc instead. Specifically, gdc -O3, to get the most aggressive optimizations the gcc backend can give me.

Second order of business: the naïve algorithm of reading 1 character at a time from the file is obviously suboptimal, so I didn't even consider it.

Third order of business: I constructed a large text file containing 10634420 lines by concatenating 5 copies of a large CSV file I obtained online. The sample had to be large enough so that overhead like program startup times and background system noise don't bias the test results too much.

With this setup, I measured the performance of my system's wc -l as the baseline for comparing the performance of the D solutions. Here's a typical output of my shell's `time wc -l` command:

	real    0m0.344s
	user    0m0.153s
	sys     0m0.191s

Since we're talking about beating wc, and others have already posted the obvious solutions of loading into a buffer and scanning the buffer, the most obvious next step up is to use std.mmfile to memory-map the file so that we don't waste effort managing file buffers ourselves: let the OS do it for us.

So here are the algorithms I tried:

1) lineCount1(): use std.mmfile to memory-map the file so that we can scan it as a single, contiguous array without any fussy buffer-management details. Result: pretty fast, but still loses to wc. Typical measurements:

	real    0m0.422s
	user    0m0.366s
	sys     0m0.054s

2) lineCount2(): just as a comparison, I did a read-into-buffer algorithm so that we have a reference as to how it performs. As expected, it's worse than the std.mmfile solution. Typical measurements:

	real    0m0.519s
	user    0m0.320s
	sys     0m0.198s

3) lineCount3(): In lineCount1(), I used std.algorithm.searching.count for counting newlines; so I thought, what if the range abstraction is introducing too much overhead?  So I decided to write a simple foreach loop instead, in hope that the simpler code will allow the gcc optimizer do a better job. Sadly, the result is pretty much the same as lineCount1:

	real    0m0.421s
	user    0m0.379s
	sys     0m0.042s

(The good news, though, is that this shows that using std.algorithm does not introduce excessive overhead compared to a hand-written loop. This proves that the high-level range abstraction does not necessarily equate to slower performance.)

4) lineCount4(): Having failed to beat wc thus far, it's time to bring out the big guns.  Since we're essentially counting newlines in the input, who says we have to process the data sequentially from start to end?  Counting from front to back or back to front gives the same answer, as does counting from middle to end, then front to middle. In particular, counting from middle to end *in parallel* with counting from front to middle, then summing the subtotals, gives the same answer. I.e., time to bust out std.parallelism. :-)  So, in my 4th algorithm, I split the input into 16KB chunks, and then scan them for newlines in parallel, creating file_size/16384 subtotals. Then I sum the subtotals in the end. Here's the result, tested on my AMD Phenom 6-core CPU:

	real    0m0.242s
	user    0m1.151s
	sys     0m0.057s

Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair margin, too.  Eat your heart out, wc!!!  (The large user time is because we're using all 6 cores at once. But the actual elapsed time is shorter.)

Here's the code for all of the above:

---------snip---------
import std.stdio;

// real    0m0.422s
// user    0m0.366s
// sys     0m0.054s
size_t lineCount1(string filename)
{
    import std.algorithm.searching : count;
    import std.mmfile;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    return data.count('\n');
}

// real    0m0.519s
// user    0m0.320s
// sys     0m0.198s
size_t lineCount2(string filename)
{
    import std.algorithm.searching : count;

    auto f = File(filename);
    ubyte[] buf;
    size_t c = 0;

    buf.length = 32768;
    do
    {
        auto d = f.rawRead(buf);
        c += d.count('\n');
    } while (!f.eof());

    return c;
}

// real    0m0.421s
// user    0m0.379s
// sys     0m0.042s
size_t lineCount3(string filename)
{
    import std.mmfile;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t c;

    foreach (i; 0 .. data.length)
    {
        if (data[i] == '\n') c++;
    }
    return c;
}

// real    0m0.242s
// user    0m1.151s
// sys     0m0.057s
size_t lineCount4(string filename)
{
    import std.algorithm.comparison : min;
    import std.algorithm.iteration : sum;
    import std.mmfile;
    import std.parallelism;
    import std.range : chunks;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t[] subtotals;

    enum blockSize = 16384;
    if (data.length == 0) return 0;
    subtotals.length = 1 + (data.length - 1) / blockSize;

    foreach (j, ref subtotal; parallel(subtotals))
    {
        size_t end = min(blockSize*(j+1), data.length);
        foreach (i; blockSize*j .. end)
        {
            if (data[i] == '\n') subtotal++;
        }
    }
    return subtotals.sum;
}

int main(string[] args)
{
    if (args.length < 2)
    {
        stderr.writeln("Please specify target file.");
        return 1;
    }

    auto filename = args[1];
    //auto count = lineCount1(filename);
    //auto count = lineCount2(filename);
    //auto count = lineCount3(filename);
    auto count = lineCount4(filename);
    writeln(count);
    return 0;
}

// vim:set sw=4 ts=4 et ai:
---------snip---------


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL
May 30, 2017
On 05/30/2017 01:02 PM, Nitram wrote:
> After reading
> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i
> was wondering how fast one can do a simple "wc -l" in D.
>
> So i made a couple short implementations and found myself confronted
> with slow results compared to "/usr/bin/wc -l".
>
> How would a implementation look like in D, which is fast?
>

I could not make the D program come close to wc's performance when the data was piped from stdin. A typical run with Daniel Kozak's program:

$ time cat deleteme.txt | wc -l
5062176

real	0m0.086s
user	0m0.068s
sys	0m0.048s

$ time cat deleteme.txt | ./deneme
5062176

real	0m0.174s
user	0m0.152s
sys	0m0.072s

Ali

May 30, 2017
P.S. After I posted the code, I took a closer look at the disassembly and found that gdc wasn't generating the best code for the parallel foreach loop body.  I haven't fully traced the cause yet, but I did find a simple optimization (arguably a micro-optimization): updating the subtotal inside the inner loop is a bit inefficient, because the subtotal is outside the loop body and, due to the way std.parallelism is implemented, is passed by reference to the loop body. This caused gdc not to enregister the subtotal, so incrementing it requires a cache roundtrip at best, a full RAM roundtrip at worst.

So I introduced a local count variable for incrementing, and only assign to the subtotal array at the end of the block:

-------snip-------
// real    0m0.240s
// user    0m1.045s
// sys     0m0.059s
size_t lineCount4(string filename)
{
    import std.algorithm.comparison : min;
    import std.algorithm.iteration : sum;
    import std.mmfile;
    import std.parallelism;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t[] subtotals;

    enum blockSize = 16384;
    if (data.length == 0) return 0;
    subtotals.length = 1 + (data.length - 1) / blockSize;

    foreach (j, ref subtotal; parallel(subtotals))
    {
        size_t end = min(blockSize*(j+1), data.length);
        size_t c = 0;
        foreach (i; blockSize*j .. end)
        {
            if (data[i] == '\n') c++;
        }
        subtotal = c;
    }
    return subtotals.sum;
}
-------snip-------

A look at the disassembly showed that gdc has enregistered the subtotal, and thereby able to eliminate a branch from the inner loop using a sete instruction and an add for the if statement.  The measured performance is only marginally better, though, and doesn't change the overall performance in a significant way.  But I thought this might be a useful tip for people who want to squeeze out the last drop of juice from their CPUs. ;-)

Next, I'll have to look into why gdc didn't unroll the inner loop, like I thought it would.  But I'm out of time now, so that will have to wait for another day.


T

-- 
The peace of mind---from knowing that viruses which exploit Microsoft system vulnerabilities cannot touch Linux---is priceless. -- Frustrated system administrator.
May 31, 2017
On Tuesday, 30 May 2017 at 23:41:01 UTC, H. S. Teoh wrote:

> This little challenge piqued my interest.  So I decided to take a shot at seeing if I could beat my system's /usr/bin/wc -l.
>
> First order of business: whenever it comes to performance, always choose the right compiler for the job...


>
> Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair margin, too.  Eat your heart out, wc!!!  (The large user time is because we're using all 6 cores at once. But the actual elapsed time is shorter.)


Hm... I cheated a little bit: took your program and compiled with `ldc2 -release -O3 -mcpu=skylake`. For data I took std.datetime concatenated 1000 times (35446000 lines). Results (minimum of 10 runs each):

wc -l
real 0.50
user 0.40
sys 0.09

lineCount1
real 0.23
user 0.19
sys 0.04

lineCount2
real 0.29
user 0.17
sys 0.12

lineCount3
real 0.23
user 0.18
sys 0.04

lineCount4
real 0.22
user 1.52
sys 0.04

Seems like all of them beat wc, so machine and compiler matter a great deal in this comparison. Thing is, on small files wc is going to win pretty much always, due to D's clunky startup. But with larger ones - wc falls far behind.
Interestingly, both lineCount1 and lineCount3 on this machine are nearly even with lineCount4 :)
May 31, 2017
Dne 31.5.2017 v 02:13 Ali Çehreli via Digitalmars-d-learn napsal(a):

> I could not make the D program come close to wc's performance when the data was piped from stdin. A typical run with Daniel Kozak's program:
>
> $ time cat deleteme.txt | wc -l
> 5062176
>
> real    0m0.086s
> user    0m0.068s
> sys    0m0.048s
>
> $ time cat deleteme.txt | ./deneme
> 5062176
>
> real    0m0.174s
> user    0m0.152s
> sys    0m0.072s
>
> Ali
>

How do you compile it? When I use ldc2 -O3  -release -mcpu=bdver1 lc.d my code is even faster than wc
May 31, 2017
On 05/30/2017 11:50 PM, Daniel Kozak via Digitalmars-d-learn wrote:

> How do you compile it? When I use ldc2 -O3  -release -mcpu=bdver1 lc.d
> my code is even faster than wc

My bad: I'm not familiar with ldc's optimization options. (I used -O3 but not -release) Now I get the same performance as 'wc -l' when I add -release.

Ali

« First   ‹ Prev
1 2 3