May 31, 2017
On Wednesday, 31 May 2017 at 17:23:46 UTC, Ali Çehreli wrote:
> On 05/30/2017 11:50 PM, Daniel Kozak via Digitalmars-d-learn wrote:
>
> > How do you compile it? When I use ldc2 -O3  -release
> -mcpu=bdver1 lc.d
> > my code is even faster than wc
>
> My bad: I'm not familiar with ldc's optimization options. (I used -O3 but not -release) Now I get the same performance as 'wc -l' when I add -release.
>
> Ali

It seems to me that your initial result is more interesting: you manage to get faster than wc *while keeping bound safety*. At a time where safety is finally getting the importance it should always have had showing that you can write fast code without sacrifiying any of it is important I think.
May 31, 2017
On Tue, May 30, 2017 at 05:13:46PM -0700, Ali Çehreli via Digitalmars-d-learn wrote: [...]
> I could not make the D program come close to wc's performance when the data was piped from stdin.
[...]

Hmm. This is a particularly interesting case, because I adapted some of my algorithms to handle reading from stdin (i.e., std.mmfile is not an option), and I could not get it to reach wc's performance!  I even installed ldc just to see if that made a difference... it was somewhat faster than gdc, but still, the timings were about twice as slow as wc.

I did some digging around, and it seems that wc is using glibc's memchr, which is highly-optimized, whereas std.algorithm.count just uses a simplistic loop. Which is strange, because I'm pretty sure somebody optimized std.algorithm some time ago to use memchr() instead of a loop when searching for a byte value in an array.  Whatever happened to that??


T

-- 
Sometimes the best solution to morale problems is just to fire all of the unhappy people. -- despair.com
May 31, 2017
On Tuesday, 30 May 2017 at 23:41:01 UTC, H. S. Teoh wrote:
> On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via Digitalmars-d-learn wrote:
>> After reading https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.
>> 
> size_t lineCount3(string filename)
> {
>     import std.mmfile;
>
>     auto f = new MmFile(filename);
>     auto data = cast(ubyte[]) f[];
>     size_t c;
>
>     foreach (i; 0 .. data.length)
>     {
>         if (data[i] == '\n') c++;
>     }
>     return c;
> }
>
> // real    0m0.242s
> // user    0m1.151s
> // sys     0m0.057s

You should try something more like
    auto data = cast(ulong[]) f[];

     foreach (i; 0 .. data.length/ulong.sizeof)

and then using bitfiedling to count the number of \n in the loaded ulong. This divides by 8 the number of load instructions. The counting of \n in the loaded word then only uses registers. It is also possible to use bit fiddling to detect and count the characters in that ulong. I don't know if it is really faster then

Here a function to detect if a given character is in an ulong

    auto detect(alias CHAR)(ulong t)
    {
      enum ulong u = CHAR;
      enum mask1 = u|(u<<8)|(u<<16)|(u<<24UL)|(u<<32UL);
      enum mask = (mask1<<32)|mask1;
      return ((t^mask) - 0x0101010101010101UL) & ~(t^mask) & 0x8080808080808080UL;
    }

The returned value is 0 if the character is not in t. And the highest bit of each byte is set if it contained the character. If the CPU has a fast popcnt it should be easy to count.


May 31, 2017
I am glad to see this participation on this issue :)
The hints about trying another compiler and std.mmfile turned out to be very effective.

Even this simple code is faster then my systems "wc -l" now:
>void main() {
>	import std.stdio;
>	writeln(lcs("benchmark.dat"));
>}
>
>size_t lcs(string filename) {
>	import std.mmfile;
>	auto f = new MmFile(filename);
>	auto data = cast(ubyte[]) f[];
>	size_t c = 0;
>	foreach(ref i ; data) {
>		if (i == '\n') c++;
>	}
>	return c;
>}

>time wc -l ./benchmark.dat
>10500000 ./benchmark.dat
>wc -l ./benchmark.dat  0,06s user 0,03s system 99% cpu 0,089 total

>time ./lcs
>10500000
>./lcs  0,05s user 0,01s system 99% cpu 0,061 total
May 31, 2017
On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn wrote:
> I did some digging around, and it seems that wc is using glibc's memchr, which is highly-optimized, whereas std.algorithm.count just uses a simplistic loop. Which is strange, because I'm pretty sure somebody optimized std.algorithm some time ago to use memchr() instead of a loop when searching for a byte value in an array.  Whatever happened to that??

I don't know, but memchr wouldn't work with CTFE, so someone might have removed it to make it work in CTFE (though that could be done with a different branch for CTFE). Or maybe it never made it into std.algorithm for one reason or another.

- Jonathan M Davis

May 31, 2017
On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote:
> On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn wrote:
> > I did some digging around, and it seems that wc is using glibc's memchr, which is highly-optimized, whereas std.algorithm.count just uses a simplistic loop. Which is strange, because I'm pretty sure somebody optimized std.algorithm some time ago to use memchr() instead of a loop when searching for a byte value in an array. Whatever happened to that??
> 
> I don't know, but memchr wouldn't work with CTFE, so someone might have removed it to make it work in CTFE (though that could be done with a different branch for CTFE). Or maybe it never made it into std.algorithm for one reason or another.
[...]

I checked the Phobos code again, and it appears that my memory deceived
me. Somebody *did* add memchr optimization to find() and its friends,
but not to count().

CTFE compatibility is not a problem, since we can just if(__ctfe) the
optimized block away.

I'm currently experimenting with a memchr-optimized version of count(), but I'm getting mixed results: on small arrays or large arrays densely packed with matching elements, the memchr version runs rather slowly, because it involves a function call into the C library per matching element.  On large arrays only sparsely populated with matching elements, though, the memchr-optimized version beats the current code by about an order of magnitude.

Since it wouldn't be a wise idea to assume sparsity of matches in Phobos, I decided to do a little more digging, and looked up the glibc implementation of memchr. The main optimization is that it iterates over the array not by byte, as a naïve loop would do, but by ulong's. (Of course, the first n bytes and last n bytes that are not ulong-aligned are checked with a per-byte loop; so for very short arrays it doesn't lose out to the naïve loop.)  In each iteration over ulong, it performs the bit-twiddling hack alluded to by Nitram to detect the presence of matching bytes, upon which it breaks out to the closing per-byte loop to find the first match. For short arrays, or arrays where a match is quickly found, it's comparable in performance to the naïve loop; for large arrays where the match is not found until later, it easily outperforms the naïve loop.

My current thought is to adopt the same approach: iterate over size_t or some such larger unit, and adapt the bit-twiddling hack to be able to count the number of matches in each size_t.  This is turning out to be trickier than I'd like, though, because there is a case where carry propagation makes it unclear how to derive the number of matches without iterating over the bytes a second time.

But this may not be a big problem, since size_t.sizeof is relatively small, so I can probably loop over individual bytes when one or more matches is detected, and a sufficiently-capable optimizer like ldc or gdc would be able to unroll this into a series of sete + add instructions, no branches that might stall the CPU pipeline. For densely-matching arrays, this should still have comparable performance to the naïve loops; for sparsely-matching arrays this should show significant speedups.


T

-- 
My program has no bugs! Only unintentional features...
June 01, 2017
On Tue, 2017-05-30 at 17:22 -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> […]
> performance in a significant way.  But I thought this might be a
> useful
> tip for people who want to squeeze out the last drop of juice from
> their
> CPUs. ;-)
> 
[…]

I have the beginnings of wc written in various languages. I may well follow this thread up to create a full D implementation of wc that people can then optimise a bit.

However, I note here that the Chapel folk are taking a quite interesting view of algorithm implementation in the Benchmarks Game. They are totally eschewing "heroic implementations" such as all the C, C++, etc. in favour of understandable and simple ones. Their take on raw performance is "if you need it to go faster, use a bigger computer". Which is quite easy when you have a number of Cray computers at your disposal. :-) Whilst having some fun, the Benchmark's Game has become all about heroic implementations on a specific computer. Which makes the Chapel line an excellent one.

OK for wc the case is different because it is about performance on the users computer. But still I like the "keep the implementation comprehensible, avoid heroic implementation".

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 31, 2017
On Thu, Jun 01, 2017 at 12:20:53AM +0100, Russel Winder via Digitalmars-d-learn wrote: [...]
> However, I note here that the Chapel folk are taking a quite interesting view of algorithm implementation in the Benchmarks Game. They are totally eschewing "heroic implementations" such as all the C, C++, etc. in favour of understandable and simple ones. Their take on raw performance is "if you need it to go faster, use a bigger computer". Which is quite easy when you have a number of Cray computers at your disposal. :-) Whilst having some fun, the Benchmark's Game has become all about heroic implementations on a specific computer. Which makes the Chapel line an excellent one.
> 
> OK for wc the case is different because it is about performance on the users computer. But still I like the "keep the implementation comprehensible, avoid heroic implementation".
[...]

With D, we can have the cake and eat it too.  The understandable / naïve implementation can be available as a fallback (and reference implementation), with OS-specific optimized implementations guarded under version() or static-if blocks so that one could, in theory, provide implementations specific to each supported platform that would give the best performance.

I disagree about the philosophy of "if you need to go faster, use a bigger computer".  There are some inherently complex problems (such as NP-complete, PSPACE-complete, or worse, outright exponential class problems) where the difference between a "heroic implementation" of a computational primitive and a naïve one may mean the difference between obtaining a result in this lifetime vs. practically never. Or, more realistically speaking, the difference between being able to solve moderately-complex problem instances vs. being able to solve only trivial toy instances.  When you're dealing with exponential complexity, every small bit counts, and you can never get a big enough computer.

An even more down-to-earth counterargument is that if CPU vendors had been content with understandable, simple CPU implementations, and eschewed "heroic", hard-to-understand things like instruction pipelines and cache hierarchies, we'd still be stuck with 16 MHz CPU's in 2017.


T

-- 
Not all rumours are as misleading as this one.
June 01, 2017
On Wednesday, 31 May 2017 at 23:03:54 UTC, H. S. Teoh wrote:
> On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote:
>> On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn wrote:
>> > I did some digging around, and it seems that wc is using glibc's memchr, which is highly-optimized, whereas std.algorithm.count just uses a simplistic loop. Which is strange, because I'm pretty sure somebody optimized std.algorithm some time ago to use memchr() instead of a loop when searching for a byte value in an array. Whatever happened to that??
>> 
>> I don't know, but memchr wouldn't work with CTFE, so someone might have removed it to make it work in CTFE (though that could be done with a different branch for CTFE). Or maybe it never made it into std.algorithm for one reason or another.
> [...]
>
> I checked the Phobos code again, and it appears that my memory deceived
> me. Somebody *did* add memchr optimization to find() and its friends,
> but not to count().
>
> CTFE compatibility is not a problem, since we can just if(__ctfe) the
> optimized block away.
>
> I'm currently experimenting with a memchr-optimized version of count(), but I'm getting mixed results: on small arrays or large arrays densely packed with matching elements, the memchr version runs rather slowly, because it involves a function call into the C library per matching element.  On large arrays only sparsely populated with matching elements, though, the memchr-optimized version beats the current code by about an order of magnitude.
>
> Since it wouldn't be a wise idea to assume sparsity of matches in Phobos, I decided to do a little more digging, and looked up the glibc implementation of memchr. The main optimization is that it iterates over the array not by byte, as a naïve loop would do, but by ulong's.

That's what I suggested above. It's the first optimisation to do when looping over a buffer (memcpy, memset, memchr etc.).


 (Of course, the first n bytes and
> last n bytes that are not ulong-aligned are checked with a per-byte loop; so for very short arrays it doesn't lose out to the naïve loop.)  In each iteration over ulong, it performs the bit-twiddling hack alluded to by Nitram to detect the presence of matching bytes, upon which it breaks out to the closing per-byte loop to find the first match. For short arrays, or arrays where a match is quickly found, it's comparable in performance to the naïve loop; for large arrays where the match is not found until later, it easily outperforms the naïve loop.

It is also important to not overdo the optimisations as it can happen that the overhead generated manifests in pessimations not visible in a specific benchmark. The code size explosion may induce I-cache misses, it can also cost I-TLB misses. Worse, using SSE or AVX can kill thread switch time or worse even reduce the turboing of the CPU.
It's currently a hot topic on realworldtech[1]. Linus Torvalds rants about this issue wit memcpy() which is over-engineered and does more harm than good in practice but has nice benchmark result.

>
> My current thought is to adopt the same approach: iterate over size_t or some such larger unit, and adapt the bit-twiddling hack to be able to count the number of matches in each size_t.  This is turning out to be trickier than I'd like, though, because there is a case where carry propagation makes it unclear how to derive the number of matches without iterating over the bytes a second time.
>
> But this may not be a big problem, since size_t.sizeof is relatively small, so I can probably loop over individual bytes when one or more matches is detected, and a sufficiently-capable optimizer like ldc or gdc would be able to unroll this into a series of sete + add instructions, no branches that might stall the CPU pipeline. For densely-matching arrays, this should still have comparable performance to the naïve loops; for sparsely-matching arrays this should show significant speedups.
>
That's what I think too, that a small and simple loop to count the matching bytes in the ulong would be a somehow faster than the bit twiddling trick which requires a population count of bits.

[1]: http://www.realworldtech.com/forum/?threadid=168200&curpostid=168700

May 31, 2017
On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via Digitalmars-d-learn wrote:
> On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via
Digitalmars-d-learn wrote:
> > On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn
> >
> > wrote:
> > > I did some digging around, and it seems that wc is using glibc's memchr, which is highly-optimized, whereas std.algorithm.count just uses a simplistic loop. Which is strange, because I'm pretty sure somebody optimized std.algorithm some time ago to use memchr() instead of a loop when searching for a byte value in an array. Whatever happened to that??
> >
> > I don't know, but memchr wouldn't work with CTFE, so someone might have removed it to make it work in CTFE (though that could be done with a different branch for CTFE). Or maybe it never made it into std.algorithm for one reason or another.
>
> [...]
>
> I checked the Phobos code again, and it appears that my memory deceived
> me. Somebody *did* add memchr optimization to find() and its friends,
> but not to count().
>
> CTFE compatibility is not a problem, since we can just if(__ctfe) the
> optimized block away.
>
> I'm currently experimenting with a memchr-optimized version of count(), but I'm getting mixed results: on small arrays or large arrays densely packed with matching elements, the memchr version runs rather slowly, because it involves a function call into the C library per matching element.  On large arrays only sparsely populated with matching elements, though, the memchr-optimized version beats the current code by about an order of magnitude.
>
> Since it wouldn't be a wise idea to assume sparsity of matches in Phobos, I decided to do a little more digging, and looked up the glibc implementation of memchr. The main optimization is that it iterates over the array not by byte, as a naïve loop would do, but by ulong's. (Of course, the first n bytes and last n bytes that are not ulong-aligned are checked with a per-byte loop; so for very short arrays it doesn't lose out to the naïve loop.)  In each iteration over ulong, it performs the bit-twiddling hack alluded to by Nitram to detect the presence of matching bytes, upon which it breaks out to the closing per-byte loop to find the first match. For short arrays, or arrays where a match is quickly found, it's comparable in performance to the naïve loop; for large arrays where the match is not found until later, it easily outperforms the naïve loop.
>
> My current thought is to adopt the same approach: iterate over size_t or some such larger unit, and adapt the bit-twiddling hack to be able to count the number of matches in each size_t.  This is turning out to be trickier than I'd like, though, because there is a case where carry propagation makes it unclear how to derive the number of matches without iterating over the bytes a second time.
>
> But this may not be a big problem, since size_t.sizeof is relatively small, so I can probably loop over individual bytes when one or more matches is detected, and a sufficiently-capable optimizer like ldc or gdc would be able to unroll this into a series of sete + add instructions, no branches that might stall the CPU pipeline. For densely-matching arrays, this should still have comparable performance to the naïve loops; for sparsely-matching arrays this should show significant speedups.

If you're really trying to make it fast, there may be something that you can do with SIMD. IIRC, Brian Schott did that with his lexer (or maybe he was just talking about it - I don't remember for sure).

- Jonathan M Davis