May 24, 2020
On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
> On Friday, 22 May 2020 at 14:13:50 UTC, kinke wrote:
>> ... and secondly use -fsave-optimization-record to inspect LLVM's optimization remarks (e.g., why a loop isn't auto-vectorized etc.).
>
> By this are you saying that SIMD happens automatically with `-mcpu=native` flag?
>
I've just tried it and the times are faster just adding the flag the times for the largest data set is very close to Julia's sometimes a little faster sometimes a little slower.

I have extended the number kernel functions to 9 and the full benchmark in Julia is taking 2 hours (I may have to re-run it since I made some more code changes). I've added the new kernels (locally) to D and updating the script now. I also need to do the same for Chapel. Once I run them all I'll update the article.

From what I can see now D wins for all but the largest data set but with new flag it's so close to Julia's that it will be a "photo finish" I might have to run at the largest data size 100 times, I'll just pick one kernel (probably dot product) for that but it will take AGES! I'll have to look at maybe running it on a cloud instance rather than locally. Populating the arrays is what takes the longest time, I'll probably do that using parallel threads. This is getting interesting!

May 24, 2020
On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
> On Friday, 22 May 2020 at 14:13:50 UTC, kinke wrote:
>> ... and secondly use -fsave-optimization-record to inspect LLVM's optimization remarks (e.g., why a loop isn't auto-vectorized etc.).
>
> By this are you saying that SIMD happens automatically with `-mcpu=native` flag?
>
I've just tried it and the times are faster just adding the flag the times for the largest data set is very close to Julia's sometimes a little faster sometimes a little slower.

I have extended the number kernel functions to 9 and the full benchmark in Julia is taking 2 hours (I may have to re-run it since I made some more code changes). I've added the new kernels (locally) to D and updating the script now. I also need to do the same for Chapel. Once I run them all I'll update the article.

From what I can see now D wins for all but the largest data set but with new flag it's so close to Julia's that it will be a "photo finish" I might have to run at the largest data size 100 times, I'll just pick one kernel (probably dot product) for that but it will take AGES! I'll have to look at maybe running it on a cloud instance rather than locally. Populating the arrays is what takes the longest time, I'll probably do that using parallel threads. This is getting interesting!

May 24, 2020
On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
> By this are you saying that SIMD happens automatically with `-mcpu=native` flag?
By default compiler will be conservative and only emit instructions that all CPUs can run. For 32 bit executables you almost never going to get SSE instructions because stack is not guaranteed to be 16 byte aligned and its not easy to prove that memory access is aligned and pointers do not alias.
When you tell compiler to generate code for specific CPU architecture (-mcpu) it can apply specific optimizations for that CPU and that includes SIMD instruction generation.
May 24, 2020
On Sunday, 24 May 2020 at 12:12:09 UTC, welkam wrote:
> On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
>> By this are you saying that SIMD happens automatically with `-mcpu=native` flag?
> When you tell compiler to generate code for specific CPU architecture (-mcpu) it can apply specific optimizations for that CPU and that includes SIMD instruction generation.

My CPU is coffee lake (Intel i9-8950HK CPU) which is not listed under `--mcpu=help` but has avx2 instructions. I tried `--mcpu=core-avx2 -mattr=+avx2,+sse4.1,+sse4.2` and getting the same improved performance as when using `--mcpu=native` am I correct in assuming that `core-avx2` is right for my CPU?

Thanks
May 25, 2020
On Sunday, 24 May 2020 at 16:51:37 UTC, data pulverizer wrote:
> My CPU is coffee lake (Intel i9-8950HK CPU) which is not listed under `--mcpu=help`

Just use --mcpu=native. Compiler will check your cpu and use correct flag for you. If you want to manually specify the architecture then look here
https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support

> I tried `--mcpu=core-avx2 -mattr=+avx2,+sse4.1,+sse4.2` and getting the same improved performance as when using `--mcpu=native` am I correct in assuming that `core-avx2` is right for my CPU?

These flags are for fine grain control. If you have to ask about them then that means you should not use them. I would have to google to answer your question. When you use --mcpu=native all appropriate flags will be set. You dont have to worry about them.

For a data scientist here is a list of flags that you should be using and in order of importance.
--O2 (Turning on optimizations is good)
--mcpu=native (allows compiler to use newer instructions and enable architecture specific optimizations. Just dont share the binaries because they might crash on older CPU's)
--O3 (less important than mcpu and sometimes doesnt provide any speed improvements so measure measure measure)
--flto=thin (link time optimizations. Good when using libraries.)
PGO (not a single flag but profile guided optimizations can add few % improvement on top of all of other flags)
http://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

--ffast-math (only useful for floating point (float, double). If you dont do math with those types then this flag does nothing)
--boundscheck=off (is D specific flag. majority of array bounds checking is remove by compiler without this flag but its good to throw it in just to make sure. But dont use this flag in development because it can catch bugs.)


When reading your message I get impression that you assumed that those newer instruction will improve performance. When it comes to performance never assume anything. Always profile before making judgments. Maybe your CPU is limited by memory bandwidth if you only have one stick of RAM and you use all 6 cores.

Anyway I looked at the disassembly of one function and its mostly SSE instructions with one AVX. That function is
arrays.Matrix!(float).Matrix kernelmatrix.calculateKernelMatrix!(kernelmatrix.DotProduct!(float).DotProduct, float).calculateKernelMatrix(kernelmatrix.DotProduct!(float).DotProduct, arrays.Matrix!(float).Matrix)

For SIMD instruction work D has specific vector types. I believe compiler guarantees that they are properly aligned but its not stated in the doc.
https://dlang.org/spec/simd.html

I have 0 experience in writing SIMD code but from what I heard over the years is that if you want to get max performance from your CPU you have to write your kernels with SIMD intrinsics.
May 29, 2020
On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
> Hi,
>
> this article grew out of a Dlang Learn thread (https://forum.dlang.org/thread/motdqixwsqmabzkdoslp@forum.dlang.org). It looks at Kernel Matrix Calculations in Chapel, D, and Julia and has a more general discussion of all three languages. Comments welcome.
>
> https://github.com/dataPulverizer/KernelMatrixBenchmark
>
> Thanks

An update of the article is available.

Thanks
1 2
Next ›   Last »