Can you paste disassembly's of the GDC code and the G++ code?
I imagine there's something trivial in the scheduler that GDC has missed. Like the other day I noticed GDC was unnecessarily generating a stack frame for leaf functions, which Iain already fixed.

I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think it'll work well.


On 25 January 2012 02:04, a <a@a.com> wrote:
Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here:

https://github.com/jerro/pfft

Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of code compiled with both DMD and GDC and also the c++ code using SSE. The results are below. The left column is base two logarithm of the array size and the right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations):

GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds)   (I took that definition from http://www.fftw.org/speed/ )

Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

Results:

GDC SSE:

2       0.833648
3       1.23383
4       6.92712
5       8.93348
6       10.9212
7       11.9306
8       12.5338
9       13.4025
10      13.5835
11      13.6992
12      13.493
13      12.7082
14      9.32621
15      9.15256
16      9.31431
17      8.38154
18      8.267
19      7.61852
20      7.14305
21      7.01786
22      6.58934

G++ SSE:

2       1.65933
3       1.96071
4       7.09683
5       9.66308
6       11.1498
7       11.9315
8       12.5712
9       13.4241
10      13.4907
11      13.6524
12      13.4215
13      12.6472
14      9.62755
15      9.24289
16      9.64412
17      8.88006
18      8.66819
19      8.28623
20      7.74581
21      7.6395
22      7.33506

GDC scalar:

2       0.808422
3       1.20835
4       2.66921
5       2.81166
6       2.99551
7       3.26423
8       3.61477
9       3.90741
10      4.04009
11      4.20405
12      4.21491
13      4.30896
14      3.79835
15      3.80497
16      3.94784
17      3.98417
18      3.58506
19      3.33992
20      3.42309
21      3.21923
22      3.25673

DMD SSE:

2       0.497946
3       0.773551
4       3.79912
5       3.78027
6       3.85155
7       4.06491
8       4.30895
9       4.53038
10      4.61006
11      4.82098
12      4.7455
13      4.85332
14      3.37768
15      3.44962
16      3.54049
17      3.40236
18      3.47339
19      3.40212
20      3.15997
21      3.32644
22      3.22767

DMD scalar:

2       0.478998
3       0.772341
4       1.6106
5       1.68516
6       1.7083
7       1.70625
8       1.68684
9       1.66931
10      1.66125
11      1.63756
12      1.61885
13      1.60459
14      1.402
15      1.39665
16      1.37894
17      1.36306
18      1.27189
19      1.21033
20      1.25719
21      1.21315
22      1.21606

SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ (and just for some values of n), which is really nice.