Thread overview | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 26, 2013 New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
If there is anything that come out of UTF-8 discussion is that I decided to dust off my experimental implementation of UTF-8 stride function. Just for fun. The key difference vs std is in handling non-ASCII case. I'm replacing bsr intrinsic with a what I call an "in-register lookup table" (neat stuff that is a piece of cake, thx to CTFE). See unittest/benchmark here: https://gist.github.com/blackwhale/5653927 I'm running tests against wiki titles dumps. For me the results are mixed but in 2 of 3 _builds_ my version consistently wins (sometimes by as much as 50%). 1. build is win32 dmd with -release -O -inline -noboundscheck my version is consistently faster (results below) 2. On linux x64 with the same config my version is consistently slower 3. LDC on linux x64 with ldc2 -O3 -d-noboundscheck my version is again faster with large margin It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot). Thus I encourage curious folks to measure/analyze it and report back (don't forget to include your processor model). The unbeatable advantage is however that my version doesn't require bsr/lzcount instruction :) BTW do ARM/PowerPC have any analog of it? Test files I used: https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0 Some dumps of my test runs win32 runs (time taken in usec): fast_stride ruwiki-latest-all-titles-in-ns0 stride 313756 myStride 229650 myStride 235091 stride 312563 fast_stride enwiki-latest-all-titles-in-ns0 stride 346577 myStride 279915 myStride 278684 stride 348902 fast_stride enwiki-latest-all-titles-in-ns0 stride 345866 myStride 280902 myStride 279780 stride 345653 fast_stride arwiki-latest-all-titles-in-ns0 stride 46548 myStride 33840 myStride 34959 stride 46342 fast_stride dewiki-latest-all-titles-in-ns0 stride 79715 myStride 64719 myStride 64672 stride 79848 dmd linux 64 runs ./fast_stride enwiki-latest-all-titles-in-ns0 stride 377258 myStride 630367 myStride 633262 stride 378523 ./fast_stride arwiki-latest-all-titles-in-ns0 stride 33924 myStride 38807 myStride 47708 stride 40160 ./fast_stride arwiki-latest-all-titles-in-ns0 stride 35110 myStride 39750 myStride 49942 stride 33597 -- Dmitry Olshansky |
May 26, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | WRT to the worse Linux64 case:
I recommend infinite-cycling it and testing in perf top.
(If you're on Ubuntu/derivative or maybe Debian, just type "perf top",
it will tell you what package to install, and once installed, "perf top" again, while the benchmark is running)
You'll get a precise real-time line-wise (with ability to drill down to ASM) profile (like "top", but for functions).
With some command-line options (google "linux perf"), you can also look
at cache misses, branch mispredictions, and so on. Compare that with the original version and you might find why it's slower.
(Don't have time to test anything right now)
On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:
> If there is anything that come out of UTF-8 discussion is that I decided to dust off my experimental implementation of UTF-8 stride function. Just for fun.
>
> The key difference vs std is in handling non-ASCII case.
> I'm replacing bsr intrinsic with a what I call an "in-register lookup table" (neat stuff that is a piece of cake, thx to CTFE).
>
> See unittest/benchmark here:
> https://gist.github.com/blackwhale/5653927
>
> I'm running tests against wiki titles dumps.
>
> For me the results are mixed but in 2 of 3 _builds_ my version consistently wins (sometimes by as much as 50%).
>
> 1. build is win32 dmd with -release -O -inline -noboundscheck
> my version is consistently faster (results below)
> 2. On linux x64 with the same config
> my version is consistently slower
>
> 3. LDC on linux x64 with ldc2 -O3 -d-noboundscheck
> my version is again faster with large margin
>
> It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot).
>
> Thus I encourage curious folks to measure/analyze it and report back (don't forget to include your processor model).
>
> The unbeatable advantage is however that my version doesn't require bsr/lzcount instruction :) BTW do ARM/PowerPC have any analog of it?
>
> Test files I used:
> https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0
> https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0
> https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0
> https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0
>
> Some dumps of my test runs
>
> win32 runs (time taken in usec):
>
> fast_stride ruwiki-latest-all-titles-in-ns0
> stride 313756
> myStride 229650
> myStride 235091
> stride 312563
>
> fast_stride enwiki-latest-all-titles-in-ns0
> stride 346577
> myStride 279915
> myStride 278684
> stride 348902
>
> fast_stride enwiki-latest-all-titles-in-ns0
> stride 345866
> myStride 280902
> myStride 279780
> stride 345653
>
> fast_stride arwiki-latest-all-titles-in-ns0
> stride 46548
> myStride 33840
> myStride 34959
> stride 46342
>
> fast_stride dewiki-latest-all-titles-in-ns0
> stride 79715
> myStride 64719
> myStride 64672
> stride 79848
>
> dmd linux 64 runs
>
> ./fast_stride enwiki-latest-all-titles-in-ns0
> stride 377258
> myStride 630367
> myStride 633262
> stride 378523
>
> ./fast_stride arwiki-latest-all-titles-in-ns0
> stride 33924
> myStride 38807
> myStride 47708
> stride 40160
>
> ./fast_stride arwiki-latest-all-titles-in-ns0
> stride 35110
> myStride 39750
> myStride 49942
> stride 33597
|
May 26, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote: > It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot). For such cases, I found Agner's benchmarking utilities to be very useful. They print exact CPU statistics, such as numbers of micro-ops, cache misses, mispredicted branches, etc. I've used them very successfully when tuning my appender implementation. To use them with D, I modified his C++ program to load a DLL and call a function, taking the DLL and function names from the command line. Original program: http://www.agner.org/optimize/testp.zip My patch (to load a DLL): http://dump.thecybershadow.net/5f55e8be5f8cd38ad60f218957ef24bb/PMCTestB.diff Usage example (sort of): https://github.com/CyberShadow/DAppenderResearch/blob/master/go-dll.bat Hope this helps :) |
May 26, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | You don't need any instrumentation for perf and can get similar output (even in real time). (Don't intend to start a profiler war, but recommend looking at perf before messings with DLLs and the like) (although perf is is Linux-only) I think the Windows version of AMD CodeAnalyst might have similar features, again without instrumentation, but never tried it (CodeAnalyst is gratis). Intel VTune definitely does, but it's not gratis. |
May 26, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kiith-Sa | On Sunday, 26 May 2013 at 21:19:12 UTC, Kiith-Sa wrote:
> You don't need any instrumentation for perf and can get similar output (even in real time). (Don't intend to start a profiler war, but recommend looking at perf before messings with DLLs and the like) (although perf is is Linux-only)
> I think the Windows version of AMD CodeAnalyst might have similar features, again without instrumentation, but never tried it (CodeAnalyst is gratis). Intel VTune definitely does, but it's not gratis.
Benchmarking and profiling are slightly different tasks. I've used AMD CodeAnalyst as well (although you need an AMD processor to use all of its features), but no profiler beats running one batch file and instantly getting a number to compare against.
|
May 26, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote: > [..] > Thus I encourage curious folks to measure/analyze it and report back (don't forget to include your processor model). > [..] Ok, I just tested on my old trusty linux 64bit (Ubuntu 12.04). I had to download those *wiki* files from this url: https://github.com/blackwhale/gsoc-bench-2012/archive/master.zip because github gave an error trying to download the ones that are more than 10Mb. I ran the tests multiple times before copying them here. And here are the results: $ uname -a Linux lolita 3.2.0-43-generic #68-Ubuntu SMP Wed May 15 03:33:33 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ grep 'model.name\|cpu.MHz' /proc/cpuinfo model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ cpu MHz : 2109.518 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ cpu MHz : 2109.518 $ grep MemTotal /proc/meminfo MemTotal: 4049780 kB $ dmd -O -inline -release -noboundscheck fast_stride.d $ for a in *wiki*; do echo; echo $a: ; ./fast_stride $a; done arwiki-latest-all-titles-in-ns0: stride 67681 myStride 55908 myStride 66026 stride 66328 dewiki-latest-all-titles-in-ns0: stride 155071 myStride 195449 myStride 196586 stride 154627 enwiki-latest-all-titles-in-ns0: stride 688482 myStride 879950 myStride 879451 stride 689087 ruwiki-latest-all-titles-in-ns0: stride 449133 myStride 364808 myStride 512485 stride 448841 --jm |
May 26, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Juan Manuel Cabo | And these are the results for the same linux 64bit system but compiling with -m32: $ dmd -m32 -O -inline -release -noboundscheck fast_stride.d $ for a in *wiki*; do echo ; echo $a: ; ./fast_stride $a; done arwiki-latest-all-titles-in-ns0: stride 89362 myStride 49974 myStride 51140 stride 88308 dewiki-latest-all-titles-in-ns0: stride 138381 myStride 116971 myStride 116662 stride 139681 enwiki-latest-all-titles-in-ns0: stride 584787 myStride 490681 myStride 490909 stride 584694 ruwiki-latest-all-titles-in-ns0: stride 585372 myStride 333905 myStride 341274 stride 585050 --jm |
May 27, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kiith-Sa | 27-May-2013 01:04, Kiith-Sa пишет: > WRT to the worse Linux64 case: > I recommend infinite-cycling it and testing in perf top. > > (If you're on Ubuntu/derivative or maybe Debian, just type "perf top", > it will tell you what package to install, and once installed, "perf > top" again, while the benchmark is running) > > You'll get a precise real-time line-wise (with ability to drill down to > ASM) profile (like "top", but for functions). > > With some command-line options (google "linux perf"), you can also look > at cache misses, branch mispredictions, and so on. Compare that with the > original version and you might find why it's slower. > > (Don't have time to test anything right now) Just tried it. Now I at least see that in 32bit my version is faster, whereas on 64bit it isn't (that is on DMD). One curiosity is that the code for ASCII case is the same yet even on English text the difference is about the same. Another one is that both function are not even partially inlined. -- Dmitry Olshansky |
May 27, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Juan Manuel Cabo | 27-May-2013 01:50, Juan Manuel Cabo пишет: > And these are the results for the same linux 64bit system but compiling > with -m32: > This is mostly in agreement with what I have on my 4-core AMD Phenom. About the same on Core i5-3550 at work. Looks like I indeed need 'clock for clock' analysis to draw any conclusion. > $ dmd -m32 -O -inline -release -noboundscheck fast_stride.d > > $ for a in *wiki*; do echo ; echo $a: ; ./fast_stride $a; done > > arwiki-latest-all-titles-in-ns0: > stride 89362 > myStride 49974 > myStride 51140 > stride 88308 > > dewiki-latest-all-titles-in-ns0: > stride 138381 > myStride 116971 > myStride 116662 > stride 139681 > > enwiki-latest-all-titles-in-ns0: > stride 584787 > myStride 490681 > myStride 490909 > stride 584694 > > ruwiki-latest-all-titles-in-ns0: > stride 585372 > myStride 333905 > myStride 341274 > stride 585050 > > --jm -- Dmitry Olshansky |
May 27, 2013 Re: New UTF-8 stride function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | 27-May-2013 01:13, Vladimir Panteleev пишет: > On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote: >> It's the kind of thing that is tremendously hard to measure accurately >> since it depends on the workload, architecture and the time spent is >> very small. So don't take it by word I'm almost certain that something >> is amiss (compiler switches and whatnot). > > For such cases, I found Agner's benchmarking utilities to be very > useful. They print exact CPU statistics, such as numbers of micro-ops, > cache misses, mispredicted branches, etc. I've used them very > successfully when tuning my appender implementation. > Yes, Agner is da man. Just hoped I could postpone this but... welcome to the micro-optimization world I guess. > To use them with D, I modified his C++ program to load a DLL and call a > function, taking the DLL and function names from the command line. > > Original program: > http://www.agner.org/optimize/testp.zip > > My patch (to load a DLL): > http://dump.thecybershadow.net/5f55e8be5f8cd38ad60f218957ef24bb/PMCTestB.diff > > > Usage example (sort of): > https://github.com/CyberShadow/DAppenderResearch/blob/master/go-dll.bat > > Hope this helps :) Thanks, I'll try it out. But jezz I have win8 so I'd start with linux version :) -- Dmitry Olshansky |
Copyright © 1999-2021 by the D Language Foundation