August 02, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | Can we use what the best FFT's provide as a guide to what the API should cover? I think it's fine if it's something we can grow into, I just don't want to keep doing what we've done too often already - chuck the old one and break everyone's code.
David Simcha wrote:
> Well then we would need to define an API for matrices or something to define how multidimensional transforms are going to work. In dstats, I've just been using ranges of ranges or tuples of ranges because it's available and works reasonably well, but I haven't really thought in detail about whether this is the optimal solution.
>
> Also, an N-D transform is apparently trivially implementable in terms of a 1-D transform, so I don't know whether an API for this would be a must-have.
>
> On 8/2/2010 9:18 PM, Walter Bright wrote:
>>
>>
>> Walter Bright wrote:
>>>
>>> The most important thing for a Phobos implementation of FFT (or any other module) is getting the API right. Having that right means we can always swap it out for a better implementation without breaking user code.
>>>
>>
>> For FFT this would mean looking at the best FFT implementations out
>> there to see what their API is. Phobos' should support a full feature
>> set, even if for now the implementation will throw exceptions for
>> things it doesn't support yet.
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
>
|
August 02, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | I guess my response to this is that, IMHO, the ridiculously simple API I have now is an absolute must-have at least as a "simple things simple" subset of a richer API anyhow, so I don't see this as a problem. For example, if we were to get fancier and start having plan objects, support N-D transforms, etc., I would still insist that you could get reasonable results in a single line by simply doing fft(myRange); without explicitly creating a plan, specifying a dimensionality, or anything else a more complex API might allow.
Furthermore, I'm thinking about how N-D FFTs would be supported, and I figure the best way would probably be to just introspect the object at compile time anyhow. Right now, anything but a range of numeric types or a range of Complex doesn't compile. This would be expanded to allow matrices and do what the user means, or allow ranges of ranges and do what the user means, or allow ranges of ranges of ranges of ranges (4-D transform) or whatever.
On 8/2/2010 10:20 PM, Walter Bright wrote:
> Can we use what the best FFT's provide as a guide to what the API should cover? I think it's fine if it's something we can grow into, I just don't want to keep doing what we've done too often already - chuck the old one and break everyone's code.
>
> David Simcha wrote:
>> Well then we would need to define an API for matrices or something to define how multidimensional transforms are going to work. In dstats, I've just been using ranges of ranges or tuples of ranges because it's available and works reasonably well, but I haven't really thought in detail about whether this is the optimal solution.
>>
>> Also, an N-D transform is apparently trivially implementable in terms of a 1-D transform, so I don't know whether an API for this would be a must-have.
>>
>> On 8/2/2010 9:18 PM, Walter Bright wrote:
>>>
>>>
>>> Walter Bright wrote:
>>>>
>>>> The most important thing for a Phobos implementation of FFT (or any other module) is getting the API right. Having that right means we can always swap it out for a better implementation without breaking user code.
>>>>
>>>
>>> For FFT this would mean looking at the best FFT implementations out
>>> there to see what their API is. Phobos' should support a full
>>> feature set, even if for now the implementation will throw
>>> exceptions for things it doesn't support yet.
>>> _______________________________________________
>>> phobos mailing list
>>> phobos at puremagic.com
>>> http://lists.puremagic.com/mailman/listinfo/phobos
>>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
>>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
|
August 02, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to Don Clugston | I've figured out that a major bottleneck is actually the computation of sine and cosine values. I had realized this to some degree and pre-computed the values for FFTs up to size 4096 at compile time. However, with huge lookup tables (apparently bigger than a CPU's L2 cache), accessing the lookup table causes too many (CPU) cache misses, and is slower than computing the values. On my machine, just computing 2^^20 sine and cosine values once takes longer than FFTW does to produce a full FFT. I therefore cannot even begin to imagine what tricks FFTW uses under the hood to get around this, unless it's doing something very clever to store them all in its plan object, and then somehow accessing them in an extremely cache-efficient manner.
It seems like this is a place where cache awareness could sneak into the design. The values could be pre-computed via a static this at runtime and stored in an immutable array, and the amount of values to compute could be decided on the fly based on whatever will fit in L2 cache.
I tried this optimization and it gets me down to ~660 milliseconds on my crappy 512K L2 cache machine, or 615 if I use single-precision floats for the lookup table (leading to 2x the size for the lookup table). This is comparable to R, which is widely respected numerics software but isn't resorting to any black magic. The only issue I see is that, with L2 caches being as huge as they've become lately, people might get upset if importing std.fft or std.numerics or whatever silently consumes 8 megs of RAM via static this, but doing lazy initialization raises the standard ugly Singleton threading issue in an extremely performance-critical situation.
On 8/2/2010 10:23 AM, Don Clugston wrote:
> On 2 August 2010 15:41, David Simcha<dsimcha at gmail.com> wrote:
>
>> Unfortunately I just downloaded the benchmark program for FFTW and my FFT is a ton slower, depending on how you look at it. Using size 2^20 as my benchmark, FFTW takes about 131 seconds to create its plan, even using -oestimate, the fastest planner. However, the plan can be reused if performing multiple FFTs of the same size, and once the plan is created, it can do an FFT of size 2^20 in about 53 milliseconds (which I find almost unbelievable because even sorting an array of size 2^20 using a well-optimized quick sort takes almost that long, and FFT seems like it should should have a much larger constant than quick sort), compared to my latest improvements to around 730 on the hardware I'm benchmarking on.
>>
>> On the other hand, for one-off use cases, the lack of needing to create a
>> plan is a big win, both from a speed and a simplicity of API point of view.
>> Benchmarking against R, which doesn't appear to use plans, making the
>> comparison somewhat more relevant, things look better for my FFT: R takes
>> about 610 milliseconds for a size 2^20 pure real FFT.
>>
> All you're seeing is the L2 cache. Did you see the link I posted to
> the NG about the internals of FFTW? The graph at the top shows FFTW is
> 40 times faster than the 'numerical recipes' code for 2^^20. So your
> factor of 13 isn't so bad. Based on that graph, if you reduce it to
> say 2^^15, the factor should drop significantly. Adding a little bit
> of cache awareness (using core.cpuid) should be able to avoid the
> performance cliff.
> Also, DMD's floating point optimiser is so primitive, you lose up to a
> factor of two immediately.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
>
|
August 02, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | Sounds ok.
David Simcha wrote:
> I guess my response to this is that, IMHO, the ridiculously simple API I have now is an absolute must-have at least as a "simple things simple" subset of a richer API anyhow, so I don't see this as a problem. For example, if we were to get fancier and start having plan objects, support N-D transforms, etc., I would still insist that you could get reasonable results in a single line by simply doing fft(myRange); without explicitly creating a plan, specifying a dimensionality, or anything else a more complex API might allow.
>
> Furthermore, I'm thinking about how N-D FFTs would be supported, and I figure the best way would probably be to just introspect the object at compile time anyhow. Right now, anything but a range of numeric types or a range of Complex doesn't compile. This would be expanded to allow matrices and do what the user means, or allow ranges of ranges and do what the user means, or allow ranges of ranges of ranges of ranges (4-D transform) or whatever.
>
> On 8/2/2010 10:20 PM, Walter Bright wrote:
>> Can we use what the best FFT's provide as a guide to what the API should cover? I think it's fine if it's something we can grow into, I just don't want to keep doing what we've done too often already - chuck the old one and break everyone's code.
>>
>
|
August 04, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to Don Clugston | Ok, I've got some more optimizations. I figured out that it's worth trading off some space to make the sin/cos lookup tables be in contiguous memory blocks for all sub-FFTs. I'm now down to ~340 ms on my benchmark, but fresh out of optimization ideas. Here are the performance characteristics of the code in case anyone has any clever ideas: 1. Skipping the slowFourier2 base case and simply returning (Yes, this produces incorrect results, but it's just to see the cost of it) shaves off a surprising ~110 milliseconds or roughly 30%. This is probably because slowFourier2 always accesses two memory positions that are virtually guaranteed to be cache misses. 2. Switching from doubles to floats for both input and output gains ~60 milliseconds, probably due to better cache performance. 3. Skipping the entire twiddle factor loop by sticking a return statement before it saves about 190 milliseconds, or only a little over half of the total execution time. This is surprising because a naive reader of the code would probably think that almost all execution time is being spent in this loop. 4. I didn't bother (at least, not yet) with a breadth first base case because recursion overhead seems fairly negligible. The code with the return statement before the twiddle factor loop and with the slowFourier2 call commented out executes in about 40 ms; this is only ~12%, the recursive implementation is more cache efficient up to a point, and the breadth-first impl wouldn't be entirely overhead-free. I've attached the latest code for review. If everyone's happy with it and noone has any more clever optimization ideas, I'd like to include it in the next release. If I do, I guess std.numerics would be the place for it? On 8/2/2010 10:23 AM, Don Clugston wrote: > On 2 August 2010 15:41, David Simcha<dsimcha at gmail.com> wrote: > >> Unfortunately I just downloaded the benchmark program for FFTW and my FFT is a ton slower, depending on how you look at it. Using size 2^20 as my benchmark, FFTW takes about 131 seconds to create its plan, even using -oestimate, the fastest planner. However, the plan can be reused if performing multiple FFTs of the same size, and once the plan is created, it can do an FFT of size 2^20 in about 53 milliseconds (which I find almost unbelievable because even sorting an array of size 2^20 using a well-optimized quick sort takes almost that long, and FFT seems like it should should have a much larger constant than quick sort), compared to my latest improvements to around 730 on the hardware I'm benchmarking on. >> >> On the other hand, for one-off use cases, the lack of needing to create a >> plan is a big win, both from a speed and a simplicity of API point of view. >> Benchmarking against R, which doesn't appear to use plans, making the >> comparison somewhat more relevant, things look better for my FFT: R takes >> about 610 milliseconds for a size 2^20 pure real FFT. >> > All you're seeing is the L2 cache. Did you see the link I posted to > the NG about the internals of FFTW? The graph at the top shows FFTW is > 40 times faster than the 'numerical recipes' code for 2^^20. So your > factor of 13 isn't so bad. Based on that graph, if you reduce it to > say 2^^15, the factor should drop significantly. Adding a little bit > of cache awareness (using core.cpuid) should be able to avoid the > performance cliff. > Also, DMD's floating point optimiser is so primitive, you lose up to a > factor of two immediately. > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fft3.d URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100804/20487348/attachment.ksh> |
August 03, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | On 08/03/2010 11:01 PM, David Simcha wrote:
> Ok, I've got some more optimizations. I figured out that it's worth trading off some space to make the sin/cos lookup tables be in contiguous memory blocks for all sub-FFTs. I'm now down to ~340 ms on my benchmark, but fresh out of optimization ideas.
I meant to say this after your previous post: function tabulation is not all. What people do is they approximate functions with a few of their Taylor expansion terms, often in conjunction with tables.
For my thesis I needed a fast tanh. I dug up a paper that had some weird formulas depending on the magnitude of the argument. I first built a table and then I interpolated using the formulas in between the table elements. It was a smashing success.
Google for <<fast sine approximation>> or so, there seems to be plenty of good stuff out there.
Andrei
|
August 03, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
David Simcha wrote: > Yea, I actually thought of stuff this and did a version with a simple linear interpolation to get numbers between a relatively small table of precomputed values. It was neither as fast nor as simple as the current solution, though it was more space efficient. sin() and cos() are pretty fast (<100 clock cycles, apparently) so a few mathematical operations to round indices to the nearest integer, perform 2 table lookups and linearly interpolate can easily destroy most of the speed advantage. Then I think a reasonable step to do is what researchers always do - take a look at what others have done. > The key thing about the current solution is that, at every sub-FFT size, the elements are accessed in sequential order without a stride or anything Mmm, that sounds like forward range to me :o). (I'm sure there are other reasons for which you need random access, so no need to reply unless you got illuminated like those people in the "Windows 7 was my idea" commercials. Man what crappy commercials.) Andrei |
August 03, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> ([...] so no need to reply unless you got illuminated like those people in the "Windows 7 was my idea" commercials. Man what crappy commercials.)
But you remembered them so they worked. (And if I ever express interest in going to ads, stage an intervention!)
|
August 04, 2010 [phobos] FFT | ||||
---|---|---|---|---|
| ||||
Posted in reply to Benjamin Shropshire | Now that I have an install of Win7 I have to say that the interface is the worst Windows interface so far. I've nearly uninstalled it out of frustration a number of times so far.
Regarding the FFT code--nice work! std.numerics works for me.
Sent from my iPhone
On Aug 3, 2010, at 9:49 PM, Benjamin Shropshire <benjamin at precisionsoftware.us> wrote:
> Andrei Alexandrescu wrote:
>> ([...] so no need to reply unless you got illuminated like those people in the "Windows 7 was my idea" commercials. Man what crappy commercials.)
>
> But you remembered them so they worked. (And if I ever express interest in going to ads, stage an intervention!)
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
|
Copyright © 1999-2021 by the D Language Foundation