SIMD on Windows (page 3)

> Well, but judging from the assembly it generates, it could be even faster. What exactly is pfft? Does it use dmd's __simd intrinsics? > Or does it only do primitive operations (* / - +) on simd types? It's a FFT implementation. It does most of the work using + - and *. There's one part off the algorithm that uses mostly shufps, and that part takes about 10% of the time (for sizes around 2 ^^ 10 when using SSE).

On 22.06.2013 02:07, Manu wrote: > It would certainly be nice in Win32, but I tend to think Win32 COFF > should be much higher priority. I have removed the dust from these patches and pushed them successfully through the test suite and unittests: https://github.com/rainers/dmd/tree/coff32 https://github.com/rainers/druntime/tree/coff32 https://github.com/rainers/phobos/tree/coff32 Compile dmd as usual, but druntime and phobos with something like druntime: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" phobos: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" "AR=<path-to-32bit-lib>" COFF32 files are generated when -m32ms is used on the command line. If you put the resulting libraries into the lib folder, using a standard installation of VS2010 might work, but I recommend adding a new section to sc.ini and adjust paths there. Mine looks like this: [Environment32ms] PATH=c:\l\vs9\Common7\IDE;%PATH% LIB="%@P%\..\..\lib32";c:\l\vs9\vc\lib;c:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Lib" DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO LINKCMD=c:\l\vs9\vc\bin\link.exe BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these.

I've said it before, but this man is a genius! :) On 23 June 2013 23:33, Rainer Schuetze <r.sagitario@gmx.de> wrote: > > On 22.06.2013 02:07, Manu wrote: > >> It would certainly be nice in Win32, but I tend to think Win32 COFF should be much higher priority. >> > > I have removed the dust from these patches and pushed them successfully through the test suite and unittests: > > https://github.com/rainers/**dmd/tree/coff32<https://github.com/rainers/dmd/tree/coff32> https://github.com/rainers/**druntime/tree/coff32<https://github.com/rainers/druntime/tree/coff32> https://github.com/rainers/**phobos/tree/coff32<https://github.com/rainers/phobos/tree/coff32> > > Compile dmd as usual, but druntime and phobos with something like > > druntime: > make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" > phobos: > make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" > "AR=<path-to-32bit-lib>" > > COFF32 files are generated when -m32ms is used on the command line. > > If you put the resulting libraries into the lib folder, using a standard installation of VS2010 might work, but I recommend adding a new section to sc.ini and adjust paths there. Mine looks like this: > > [Environment32ms] > PATH=c:\l\vs9\Common7\IDE;%**PATH% > LIB="%@P%\..\..\lib32";c:\l\**vs9\vc\lib;c:\Program Files (x86)\Microsoft > SDKs\Windows\v7.1A\Lib" > DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO > LINKCMD=c:\l\vs9\vc\bin\link.**exe > > BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these. >

On 2013-06-23 15:33, Rainer Schuetze wrote: > I have removed the dust from these patches and pushed them successfully > through the test suite and unittests: > > https://github.com/rainers/dmd/tree/coff32 > https://github.com/rainers/druntime/tree/coff32 > https://github.com/rainers/phobos/tree/coff32 > > Compile dmd as usual, but druntime and phobos with something like > > druntime: > make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" > phobos: > make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" > "AR=<path-to-32bit-lib>" > > COFF32 files are generated when -m32ms is used on the command line. So, you have implemented support for COFF 32bit? How long have you been hiding this :) Although I'm not a Windows user I consider it great news. -- /Jacob Carlborg

On 23.06.2013 15:33, Rainer Schuetze wrote: > BTW: I also found some bugs in the Win64 along the way, I'll create pull > requests for these. https://github.com/D-Programming-Language/dmd/pull/2253 https://github.com/D-Programming-Language/dmd/pull/2254

On 23.06.2013 20:24, Jacob Carlborg wrote: > On 2013-06-23 15:33, Rainer Schuetze wrote: > >> COFF32 files are generated when -m32ms is used on the command line. > > So, you have implemented support for COFF 32bit? How long have you been > hiding this :) Although I'm not a Windows user I consider it great news. > I experimented with it a few times, but it didn't work good enough until this week-end.

On 23.06.2013 21:55, Michael wrote: > Cool))) > > Any chances to see it [coff32] in official build? > Let's see if Walter approves. There is one maybe disruptive change: with two different C runtimes available for Win32, versioning on Win32/Win64 no longer works. I added versions CRuntime_DigitalMars and CRuntime_Microsoft (and CRuntime_GNU for anything else), and adapting to this makes most of the changes in druntime and phobos.

Alright, I'm now officially building for Windows x64 (amd64). I've created this early benchmark http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As you can see below, on my machine there is almost zero difference. Am I missing something? //===SIMD=== 0 1.#INF 5 1.#INF <-- vector result hnsecs: 100006 <-- duration time 0 1.#INF 5 1.#INF hnsecs: 90005 0 1.#INF 5 1.#INF hnsecs: 90006 //===SCALAR=== 0 1.#INF 5 1.#INF hnsecs: 90005 0 1.#INF 5 1.#INF hnsecs: 100005 0 1.#INF 5 1.#INF hnsecs: 100006

June 29, 2013

Re: SIMD on Windows

Posted by jerro
in reply to Jonathan Dunlap

Permalink

jerro

Posted in reply to Jonathan Dunlap

Permalink

On Saturday, 29 June 2013 at 14:39:44 UTC, Jonathan Dunlap wrote:
> Alright, I'm now officially building for Windows x64 (amd64). I've created this early benchmark http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As you can see below, on my machine there is almost zero difference. Am I missing something?
>
> //===SIMD===
> 0 1.#INF 5 1.#INF <-- vector result
> hnsecs: 100006 <-- duration time
> 0 1.#INF 5 1.#INF
> hnsecs: 90005
> 0 1.#INF 5 1.#INF
> hnsecs: 90006
> //===SCALAR===
> 0 1.#INF 5 1.#INF
> hnsecs: 90005
> 0 1.#INF 5 1.#INF
> hnsecs: 100005
> 0 1.#INF 5 1.#INF
> hnsecs: 100006

First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization. It also seems that the fact that g, s, i and d are class fields and that g is a static array makes DMD load them from memory and store them back on every iteration even when calcSIMD and calcScalar are inlined.

But even if I make the class final and build it with gdc -O3 -finline-functions -frelease -march=native (in which case GDC generates assembly that looks optimal to me), the scalar version is still a bit faster than the vector version. The main reason for that is that even with scalar code, the compiler can do multiple operations in parallel. So on Sandy Bridge CPUs, for example, floating point multiplication takes 5 cycles to complete, but the processor can do one multiplication per cycle. So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel.

That would explain the scalar code being equaly fast, but not faster than vector code. The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.

Forums