March 13, 2020
On Thursday, 12 March 2020 at 20:39:59 UTC, p.shkadzko wrote:
> On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
>> [...]
>
> I am actually intrigued with the timings of huge matrices. Why Mir D and Standard D are so much better than NumPy? Once we get to 500x600, 1000x1000 sizes there is a huge drop in performance for NumPy and not so much for D. You mentioned L3 cache but CPU architecture is equal for all the benchmarks so what's going on?

The interpreter getting in the way of the hardware prefetcher, maybe.
March 14, 2020
On 2020-03-12 13:59, Pavel Shkadzko wrote:
> I have done several benchmarks against Numpy for various 2D matrix operations. The purpose was mere curiosity and spread the word about Mir D library among the office data engineers.
> Since I am not a D expert, I would be happy if someone could take a second look and double check.
> 
> https://github.com/tastyminerals/mir_benchmarks
> 
> Compile and run the project via: dub run --compiler=ldc --build=release

Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.

-- 
/Jacob Carlborg
March 14, 2020
On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
> On 2020-03-12 13:59, Pavel Shkadzko wrote:
>> I have done several benchmarks against Numpy for various 2D matrix operations. The purpose was mere curiosity and spread the word about Mir D library among the office data engineers.
>> Since I am not a D expert, I would be happy if someone could take a second look and double check.
>> 
>> https://github.com/tastyminerals/mir_benchmarks
>> 
>> Compile and run the project via: dub run --compiler=ldc --build=release
>
> Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.

The problem is that Numpy uses its own version of OpenBLAS, that is multithread including Level 1 BLAS operations like L2 norm and dot product, while D code is a single thread.
March 15, 2020
On Saturday, 14 March 2020 at 09:34:55 UTC, 9il wrote:
> On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
>> On 2020-03-12 13:59, Pavel Shkadzko wrote:
>>> [...]
>>
>> Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.
>
> The problem is that Numpy uses its own version of OpenBLAS, that is multithread including Level 1 BLAS operations like L2 norm and dot product, while D code is a single thread.

My version of NumPy is installed with anaconda and it looks like anaconda numpy package comes with mkl libraries.

I have updated the benchmarks with respect to single/multi thread.
March 15, 2020
On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
> On 2020-03-12 13:59, Pavel Shkadzko wrote:
>> I have done several benchmarks against Numpy for various 2D matrix operations. The purpose was mere curiosity and spread the word about Mir D library among the office data engineers.
>> Since I am not a D expert, I would be happy if someone could take a second look and double check.
>> 
>> https://github.com/tastyminerals/mir_benchmarks
>> 
>> Compile and run the project via: dub run --compiler=ldc --build=release
>
> Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.

If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough then it doesn't improve anything.
For PGO, I am a bit confused how to use it with dub -- dflags-ldc: ["-O3"]? It compiles but I see no difference. By default, ldc2 should be using O2 -- good optimizations.
March 15, 2020
On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
> On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
>> On 2020-03-12 13:59, Pavel Shkadzko wrote:
>>> I have done several benchmarks against Numpy for various 2D matrix operations. The purpose was mere curiosity and spread the word about Mir D library among the office data engineers.
>>> Since I am not a D expert, I would be happy if someone could take a second look and double check.
>>> 
>>> https://github.com/tastyminerals/mir_benchmarks
>>> 
>>> Compile and run the project via: dub run --compiler=ldc --build=release
>>
>> Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.
>
> If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough then it doesn't improve anything.

Try:
    "dflags-ldc" : ["-flto=thin", "-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ]

The "-defaultlib=..." parameter engages LTO for phobos and druntime. You can also use "-flto=full" rather than "thin". I've had good results with "thin". Not sure if the "-singleobj" parameter helps.

> For PGO, I am a bit confused how to use it with dub -- dflags-ldc: ["-O3"]? It compiles but I see no difference. By default, ldc2 should be using O2 -- good optimizations.

PGO (profile guided optimization) is a multi-step process. First step is create an instrumented build (-fprofile-instr-generate). Second step is to run the instrumented binary on a representative workload. Last step is to use the resulting workload in the final build (-fprofile-instr-use).

For information on PGO see Johan Engelen's blog page: https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

I have done studies on LTO and PGO and found both beneficial, often significantly. The largest gains came in code running in tight loops that were included code pulled from libraries (e.g. phobos, druntime). It was hard to predict what code was going benefit from LTO/PGO.

I've found it tricky to use dub for the full PGO process. (Creating the instrumented build, generating the profile data, and using it in the final build process.) Mostly I've used make for this. I did get it to work in a simple performance test app: https://github.com/jondegenhardt/dcat-perf. It doesn't document how the PGO steps work, but it dub.json file is relatively short and repository README.md contains the build instructions for both LTO and LTO plus PGO.

--Jon
March 16, 2020
On Sunday, 15 March 2020 at 20:15:07 UTC, Jon Degenhardt wrote:
> On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
>> [...]
>
> Try:
>     "dflags-ldc" : ["-flto=thin", "-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ]
>
> [...]

LTO and PGO are useless for this kind of stuff. Nothing to inline, the code is to simple and generic. Nothing to apply this technology for.
1 2 3
Next ›   Last »