Jump to page: 1 2
Thread overview
FFT in D (using SIMD) and benchmarks
Jan 25, 2012
a
Jan 25, 2012
Iain Buclaw
Jan 25, 2012
a
Jan 25, 2012
bearophile
Jan 25, 2012
a
Jan 25, 2012
bearophile
Jan 25, 2012
a
Jan 25, 2012
Marco Leise
Jan 25, 2012
Robert Jacques
Jan 25, 2012
Manu
Jan 26, 2012
a
Jan 25, 2012
Iain Buclaw
OT
Jan 25, 2012
Trass3r
Jan 26, 2012
Jonathan M Davis
Jan 25, 2012
Manu
January 25, 2012
Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here:

https://github.com/jerro/pfft

Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of code compiled with both DMD and GDC and also the c++ code using SSE. The results are below. The left column is base two logarithm of the array size and the right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations):

GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds)   (I took that definition from http://www.fftw.org/speed/ )

Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

Results:

GDC SSE:

2	0.833648
3	1.23383
4	6.92712
5	8.93348
6	10.9212
7	11.9306
8	12.5338
9	13.4025
10	13.5835
11	13.6992
12	13.493
13	12.7082
14	9.32621
15	9.15256
16	9.31431
17	8.38154
18	8.267
19	7.61852
20	7.14305
21	7.01786
22	6.58934

G++ SSE:

2	1.65933
3	1.96071
4	7.09683
5	9.66308
6	11.1498
7	11.9315
8	12.5712
9	13.4241
10	13.4907
11	13.6524
12	13.4215
13	12.6472
14	9.62755
15	9.24289
16	9.64412
17	8.88006
18	8.66819
19	8.28623
20	7.74581
21	7.6395
22	7.33506

GDC scalar:

2	0.808422
3	1.20835
4	2.66921
5	2.81166
6	2.99551
7	3.26423
8	3.61477
9	3.90741
10	4.04009
11	4.20405
12	4.21491
13	4.30896
14	3.79835
15	3.80497
16	3.94784
17	3.98417
18	3.58506
19	3.33992
20	3.42309
21	3.21923
22	3.25673

DMD SSE:

2	0.497946
3	0.773551
4	3.79912
5	3.78027
6	3.85155
7	4.06491
8	4.30895
9	4.53038
10	4.61006
11	4.82098
12	4.7455
13	4.85332
14	3.37768
15	3.44962
16	3.54049
17	3.40236
18	3.47339
19	3.40212
20	3.15997
21	3.32644
22	3.22767

DMD scalar:

2	0.478998
3	0.772341
4	1.6106
5	1.68516
6	1.7083
7	1.70625
8	1.68684
9	1.66931
10	1.66125
11	1.63756
12	1.61885
13	1.60459
14	1.402
15	1.39665
16	1.37894
17	1.36306
18	1.27189
19	1.21033
20	1.25719
21	1.21315
22	1.21606

SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ (and just for some values of n), which is really nice.
January 25, 2012
On 25 January 2012 00:04, a <a@a.com> wrote:
> Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here:
>
> https://github.com/jerro/pfft
>
> Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of code compiled with both DMD and GDC and also the c++ code using SSE. The results are below. The left column is base two logarithm of the array size and the right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations):
>
> GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds)   (I took that
> definition from http://www.fftw.org/speed/ )
>
> Chart: http://cloud.github.com/downloads/jerro/pfft/image.png
>
> Results:
>
> GDC SSE:
>
> 2       0.833648
> 3       1.23383
> 4       6.92712
> 5       8.93348
> 6       10.9212
> 7       11.9306
> 8       12.5338
> 9       13.4025
> 10      13.5835
> 11      13.6992
> 12      13.493
> 13      12.7082
> 14      9.32621
> 15      9.15256
> 16      9.31431
> 17      8.38154
> 18      8.267
> 19      7.61852
> 20      7.14305
> 21      7.01786
> 22      6.58934
>
> G++ SSE:
>
> 2       1.65933
> 3       1.96071
> 4       7.09683
> 5       9.66308
> 6       11.1498
> 7       11.9315
> 8       12.5712
> 9       13.4241
> 10      13.4907
> 11      13.6524
> 12      13.4215
> 13      12.6472
> 14      9.62755
> 15      9.24289
> 16      9.64412
> 17      8.88006
> 18      8.66819
> 19      8.28623
> 20      7.74581
> 21      7.6395
> 22      7.33506
>
> GDC scalar:
>
> 2       0.808422
> 3       1.20835
> 4       2.66921
> 5       2.81166
> 6       2.99551
> 7       3.26423
> 8       3.61477
> 9       3.90741
> 10      4.04009
> 11      4.20405
> 12      4.21491
> 13      4.30896
> 14      3.79835
> 15      3.80497
> 16      3.94784
> 17      3.98417
> 18      3.58506
> 19      3.33992
> 20      3.42309
> 21      3.21923
> 22      3.25673
>
> DMD SSE:
>
> 2       0.497946
> 3       0.773551
> 4       3.79912
> 5       3.78027
> 6       3.85155
> 7       4.06491
> 8       4.30895
> 9       4.53038
> 10      4.61006
> 11      4.82098
> 12      4.7455
> 13      4.85332
> 14      3.37768
> 15      3.44962
> 16      3.54049
> 17      3.40236
> 18      3.47339
> 19      3.40212
> 20      3.15997
> 21      3.32644
> 22      3.22767
>
> DMD scalar:
>
> 2       0.478998
> 3       0.772341
> 4       1.6106
> 5       1.68516
> 6       1.7083
> 7       1.70625
> 8       1.68684
> 9       1.66931
> 10      1.66125
> 11      1.63756
> 12      1.61885
> 13      1.60459
> 14      1.402
> 15      1.39665
> 16      1.37894
> 17      1.36306
> 18      1.27189
> 19      1.21033
> 20      1.25719
> 21      1.21315
> 22      1.21606
>
> SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ (and just for some values of n), which is really nice.


That is quite interesting, and really cool at the same time.  :)

Did you run into any issues with GDC's implementation of vectors?

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
January 25, 2012
> Did you run into any issues with GDC's implementation of vectors?

No, GDC vectors worked fine. It was a pretty straight forward port from C++ version using intrinsics from emmintrin.h.
January 25, 2012
a:

> Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).

It seems an instruction worth having in dmd too.


> Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance.
(And if possible I'd like to see two lines for the LDC2 compiler too.)

Bye,
bearophile
January 25, 2012
On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
> a:
>
>> Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).
>
> It seems an instruction worth having in dmd too.
>
>
>> Chart: http://cloud.github.com/downloads/jerro/pfft/image.png
>
> I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance.
> (And if possible I'd like to see two lines for the LDC2 compiler too.)
>
> Bye,
> bearophile

"bench" program in the fftw test directory gives this when run in a loop:


2	Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7
3	Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3
4	Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9
5	Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108
6	Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225
7	Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717
8	Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501
9	Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958
10	Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'': 20690
11	Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'': 19693
12	Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'': 18622
13	Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'': 14784
14	Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'': 13913
15	Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'': 12659
16	Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'': 12409
17	Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'': 11200
18	Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'': 10338
19	Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'': 5243.4
20	Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'': 5102.8
21	Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'': 5135.2
22	Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'': 5209.8

This was with fftw compiled for single precision and with SSE, but without AVX support. When I compiled fftw with AVX support, the peak was at about 30 GFLOPS, IIRC. It is possible that it would be even faster if I configured it in a different way. The C++ version of my FFT also supports AVX and gets to about 24 GFLOPS when using it. If AVX types will be added to D, I will port that part too.
January 25, 2012
On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
> a:
>
>> Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).
>
> It seems an instruction worth having in dmd too.
>
>
>> Chart: http://cloud.github.com/downloads/jerro/pfft/image.png
>
> I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance.
> (And if possible I'd like to see two lines for the LDC2 compiler too.)
>
> Bye,
> bearophile

I have updated the graph now.
January 25, 2012
a:

> I have updated the graph now.

Thank you, it's a very nice graph. The reference lines of FFTW help. From how much AVX improves the situation I see Intel engineers know their work; I didn't know AVX is able to bring such performance improvement (in carefully written code).


> The C++ version of my FFT also supports AVX and gets to about 24 GFLOPS when using it.

So your FFT code is rather good, despite being so much simpler and shorter than FFTW code :-) Your code seems good to replace Phobos std.numeric.fft(), if you will ever want to donate your FFT code to Phobos.


> If AVX types will be added to D, I will port that part too.

Walter has added support for YMM registers too in D/DMD, so I presume having AVX1 instructions are quite an option. I presume we will have them too. But ask Walter for more details on this.

And hopefully we'll see good implementations of this algorithm too :-) http://web.mit.edu/newsoffice/2012/faster-fourier-transforms-0118.html?tmpl=component&print=1

Bye,
bearophile
January 25, 2012
If you want to, you can take a look at the link in this forum post: http://encode.ru/threads/1461-New-Fast-Fourier-Transform-Algorithm?highlight=FFT
It looks like MIT has a new FFT algorithm.
January 25, 2012
On Wed, 25 Jan 2012 05:58:52 -0600, Marco Leise <Marco.Leise@gmx.de> wrote:
> If you want to, you can take a look at the link in this forum post:
> http://encode.ru/threads/1461-New-Fast-Fourier-Transform-Algorithm?highlight=FFT
> It looks like MIT has a new FFT algorithm.

For those who want the short version. MIT has developed a faster method of doing a sparse FFT transform. Since many applications (like compression) work by throwing away the 'little' frequencies, those algorithms can get a practical speed-up of ~10x, depending on compression level by switching from the FFT's O(n log n) algorithm to MIT's new O(k log n) algorithm. The new algorithm is still slower for doing all n frequencies, so it's not going to replace all FFT usage, just lossy FFT usage.
January 25, 2012
Can you paste disassembly's of the GDC code and the G++ code?
I imagine there's something trivial in the scheduler that GDC has missed.
Like the other day I noticed GDC was unnecessarily generating a stack frame
for leaf functions, which Iain already fixed.

I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think it'll work well.


On 25 January 2012 02:04, a <a@a.com> wrote:

> Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here:
>
> https://github.com/jerro/pfft
>
> Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of code compiled with both DMD and GDC and also the c++ code using SSE. The results are below. The left column is base two logarithm of the array size and the right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations):
>
> GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds)   (I took that
> definition from http://www.fftw.org/speed/ )
>
> Chart: http://cloud.github.com/**downloads/jerro/pfft/image.png<http://cloud.github.com/downloads/jerro/pfft/image.png>
>
> Results:
>
> GDC SSE:
>
> 2       0.833648
> 3       1.23383
> 4       6.92712
> 5       8.93348
> 6       10.9212
> 7       11.9306
> 8       12.5338
> 9       13.4025
> 10      13.5835
> 11      13.6992
> 12      13.493
> 13      12.7082
> 14      9.32621
> 15      9.15256
> 16      9.31431
> 17      8.38154
> 18      8.267
> 19      7.61852
> 20      7.14305
> 21      7.01786
> 22      6.58934
>
> G++ SSE:
>
> 2       1.65933
> 3       1.96071
> 4       7.09683
> 5       9.66308
> 6       11.1498
> 7       11.9315
> 8       12.5712
> 9       13.4241
> 10      13.4907
> 11      13.6524
> 12      13.4215
> 13      12.6472
> 14      9.62755
> 15      9.24289
> 16      9.64412
> 17      8.88006
> 18      8.66819
> 19      8.28623
> 20      7.74581
> 21      7.6395
> 22      7.33506
>
> GDC scalar:
>
> 2       0.808422
> 3       1.20835
> 4       2.66921
> 5       2.81166
> 6       2.99551
> 7       3.26423
> 8       3.61477
> 9       3.90741
> 10      4.04009
> 11      4.20405
> 12      4.21491
> 13      4.30896
> 14      3.79835
> 15      3.80497
> 16      3.94784
> 17      3.98417
> 18      3.58506
> 19      3.33992
> 20      3.42309
> 21      3.21923
> 22      3.25673
>
> DMD SSE:
>
> 2       0.497946
> 3       0.773551
> 4       3.79912
> 5       3.78027
> 6       3.85155
> 7       4.06491
> 8       4.30895
> 9       4.53038
> 10      4.61006
> 11      4.82098
> 12      4.7455
> 13      4.85332
> 14      3.37768
> 15      3.44962
> 16      3.54049
> 17      3.40236
> 18      3.47339
> 19      3.40212
> 20      3.15997
> 21      3.32644
> 22      3.22767
>
> DMD scalar:
>
> 2       0.478998
> 3       0.772341
> 4       1.6106
> 5       1.68516
> 6       1.7083
> 7       1.70625
> 8       1.68684
> 9       1.66931
> 10      1.66125
> 11      1.63756
> 12      1.61885
> 13      1.60459
> 14      1.402
> 15      1.39665
> 16      1.37894
> 17      1.36306
> 18      1.27189
> 19      1.21033
> 20      1.25719
> 21      1.21315
> 22      1.21606
>
> SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ (and just for some values of n), which is really nice.
>


« First   ‹ Prev
1 2