August 02, 2010
Yeah, I saw the article and browsed it.  I didn't read it in detail, though, because it seemed like it was all about really low-level stuff that I wasn't sure how to apply.

On Mon, Aug 2, 2010 at 10:23 AM, Don Clugston <dclugston at googlemail.com>wrote:

> On 2 August 2010 15:41, David Simcha <dsimcha at gmail.com> wrote:
> > Unfortunately I just downloaded the benchmark program for FFTW and my FFT
> is
> > a ton slower, depending on how you look at it.  Using size 2^20 as my benchmark, FFTW takes about 131 seconds to create its plan, even using -oestimate, the fastest planner.  However, the plan can be reused if performing multiple FFTs of the same size, and once the plan is created,
> it
> > can do an FFT of size 2^20 in about 53 milliseconds (which I find almost unbelievable because even sorting an array of size 2^20 using a well-optimized quick sort takes almost that long, and FFT seems like it should should have a much larger constant than quick sort), compared to
> my
> > latest improvements to around 730 on the hardware I'm benchmarking on.
> >
> > On the other hand, for one-off use cases, the lack of needing to create a plan is a big win, both from a speed and a simplicity of API point of
> view.
> >  Benchmarking against R, which doesn't appear to use plans, making the
> > comparison somewhat more relevant, things look better for my FFT:  R
> takes
> > about 610 milliseconds for a size 2^20 pure real FFT.
>
> All you're seeing is the L2 cache. Did you see the link I posted to
> the NG about the internals of FFTW? The graph at the top shows FFTW is
> 40 times faster than the 'numerical recipes' code for 2^^20. So your
> factor of 13 isn't so bad. Based on that graph, if you reduce it to
> say 2^^15, the factor should drop significantly. Adding a little bit
> of cache awareness (using core.cpuid) should be able to avoid the
> performance cliff.
> Also, DMD's floating point optimiser is so primitive, you lose up to a
> factor of two immediately.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100802/bc3a114f/attachment-0001.html>
August 02, 2010
BTW, I've started thinking a little more about big picture issues here, and I'm debating whether it's a higher priority to improve performance on power of 2 sizes, or to try to support other radix values.

There are two use cases for an FFT that I'm familiar with.  The power-of-two limitation isn't severe in either of them.

1.  Getting an idea of what the spectrum of a signal looks like.  Here, it's common to pad with zeros because the plots become clearer looking, even if your FFT lib doesn't require it.

2.  Computing a convolution.  Here, padding with zeros is necessary anyhow to prevent the signal from being "interpreted" as periodic.

Are there any major use cases where the power of two limitation is severe, or should I just focus on optimizing powers of 2 and call it a day?

On Mon, Aug 2, 2010 at 10:23 AM, Don Clugston <dclugston at googlemail.com>wrote:

> On 2 August 2010 15:41, David Simcha <dsimcha at gmail.com> wrote:
> > Unfortunately I just downloaded the benchmark program for FFTW and my FFT
> is
> > a ton slower, depending on how you look at it.  Using size 2^20 as my benchmark, FFTW takes about 131 seconds to create its plan, even using -oestimate, the fastest planner.  However, the plan can be reused if performing multiple FFTs of the same size, and once the plan is created,
> it
> > can do an FFT of size 2^20 in about 53 milliseconds (which I find almost unbelievable because even sorting an array of size 2^20 using a well-optimized quick sort takes almost that long, and FFT seems like it should should have a much larger constant than quick sort), compared to
> my
> > latest improvements to around 730 on the hardware I'm benchmarking on.
> >
> > On the other hand, for one-off use cases, the lack of needing to create a plan is a big win, both from a speed and a simplicity of API point of
> view.
> >  Benchmarking against R, which doesn't appear to use plans, making the
> > comparison somewhat more relevant, things look better for my FFT:  R
> takes
> > about 610 milliseconds for a size 2^20 pure real FFT.
>
> All you're seeing is the L2 cache. Did you see the link I posted to
> the NG about the internals of FFTW? The graph at the top shows FFTW is
> 40 times faster than the 'numerical recipes' code for 2^^20. So your
> factor of 13 isn't so bad. Based on that graph, if you reduce it to
> say 2^^15, the factor should drop significantly. Adding a little bit
> of cache awareness (using core.cpuid) should be able to avoid the
> performance cliff.
> Also, DMD's floating point optimiser is so primitive, you lose up to a
> factor of two immediately.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100802/36a79e68/attachment.html>
August 02, 2010
Oh, also, I don't think that cache effects are the main bottleneck because switching to single-precision floats for both input and output has a negligible effect on performance even though it cuts the size of the working set in half.

On Mon, Aug 2, 2010 at 10:23 AM, Don Clugston <dclugston at googlemail.com>wrote:

> On 2 August 2010 15:41, David Simcha <dsimcha at gmail.com> wrote:
> > Unfortunately I just downloaded the benchmark program for FFTW and my FFT
> is
> > a ton slower, depending on how you look at it.  Using size 2^20 as my benchmark, FFTW takes about 131 seconds to create its plan, even using -oestimate, the fastest planner.  However, the plan can be reused if performing multiple FFTs of the same size, and once the plan is created,
> it
> > can do an FFT of size 2^20 in about 53 milliseconds (which I find almost unbelievable because even sorting an array of size 2^20 using a well-optimized quick sort takes almost that long, and FFT seems like it should should have a much larger constant than quick sort), compared to
> my
> > latest improvements to around 730 on the hardware I'm benchmarking on.
> >
> > On the other hand, for one-off use cases, the lack of needing to create a plan is a big win, both from a speed and a simplicity of API point of
> view.
> >  Benchmarking against R, which doesn't appear to use plans, making the
> > comparison somewhat more relevant, things look better for my FFT:  R
> takes
> > about 610 milliseconds for a size 2^20 pure real FFT.
>
> All you're seeing is the L2 cache. Did you see the link I posted to
> the NG about the internals of FFTW? The graph at the top shows FFTW is
> 40 times faster than the 'numerical recipes' code for 2^^20. So your
> factor of 13 isn't so bad. Based on that graph, if you reduce it to
> say 2^^15, the factor should drop significantly. Adding a little bit
> of cache awareness (using core.cpuid) should be able to avoid the
> performance cliff.
> Also, DMD's floating point optimiser is so primitive, you lose up to a
> factor of two immediately.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100802/4a96ba4f/attachment.html>
August 02, 2010
I'll defer answer to this to others, as I haven't used FFT for a long time.

I do remember, however, that the discrete cosine transform was actually more popular in the circles I frequented. Would it be difficult to adapt your implementation to offer dct?


Andrei

David Simcha wrote:
> BTW, I've started thinking a little more about big picture issues here, and I'm debating whether it's a higher priority to improve performance on power of 2 sizes, or to try to support other radix values.
> 
> There are two use cases for an FFT that I'm familiar with.  The power-of-two limitation isn't severe in either of them.
> 
> 1.  Getting an idea of what the spectrum of a signal looks like.  Here, it's common to pad with zeros because the plots become clearer looking, even if your FFT lib doesn't require it.
> 
> 2.  Computing a convolution.  Here, padding with zeros is necessary anyhow to prevent the signal from being "interpreted" as periodic.
> 
> Are there any major use cases where the power of two limitation is severe, or should I just focus on optimizing powers of 2 and call it a day?
> 
> On Mon, Aug 2, 2010 at 10:23 AM, Don Clugston <dclugston at googlemail.com <mailto:dclugston at googlemail.com>> wrote:
> 
>     On 2 August 2010 15:41, David Simcha <dsimcha at gmail.com
>     <mailto:dsimcha at gmail.com>> wrote:
>      > Unfortunately I just downloaded the benchmark program for FFTW
>     and my FFT is
>      > a ton slower, depending on how you look at it.  Using size 2^20 as my
>      > benchmark, FFTW takes about 131 seconds to create its plan, even
>     using
>      > -oestimate, the fastest planner.  However, the plan can be reused if
>      > performing multiple FFTs of the same size, and once the plan is
>     created, it
>      > can do an FFT of size 2^20 in about 53 milliseconds (which I find
>     almost
>      > unbelievable because even sorting an array of size 2^20 using a
>      > well-optimized quick sort takes almost that long, and FFT seems
>     like it
>      > should should have a much larger constant than quick sort),
>     compared to my
>      > latest improvements to around 730 on the hardware I'm
>     benchmarking on.
>      >
>      > On the other hand, for one-off use cases, the lack of needing to
>     create a
>      > plan is a big win, both from a speed and a simplicity of API
>     point of view.
>      >  Benchmarking against R, which doesn't appear to use plans,
>     making the
>      > comparison somewhat more relevant, things look better for my FFT:
>      R takes
>      > about 610 milliseconds for a size 2^20 pure real FFT.
> 
>     All you're seeing is the L2 cache. Did you see the link I posted to
>     the NG about the internals of FFTW? The graph at the top shows FFTW is
>     40 times faster than the 'numerical recipes' code for 2^^20. So your
>     factor of 13 isn't so bad. Based on that graph, if you reduce it to
>     say 2^^15, the factor should drop significantly. Adding a little bit
>     of cache awareness (using core.cpuid) should be able to avoid the
>     performance cliff.
>     Also, DMD's floating point optimiser is so primitive, you lose up to a
>     factor of two immediately.
>     _______________________________________________
>     phobos mailing list
>     phobos at puremagic.com <mailto:phobos at puremagic.com>
>     http://lists.puremagic.com/mailman/listinfo/phobos
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
August 02, 2010
I know basically nothing about discrete cosine transforms except what I've learned in the past few minutes from Wikipedia, but apparently an FFT can be turned into a DCT in O(N), and it's not terribly uncommon to use an FFT plus some O(N) ops to compute a DCT.

On Mon, Aug 2, 2010 at 1:52 PM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> I'll defer answer to this to others, as I haven't used FFT for a long time.
>
> I do remember, however, that the discrete cosine transform was actually more popular in the circles I frequented. Would it be difficult to adapt your implementation to offer dct?
>
>
> Andrei
>
> David Simcha wrote:
>
>> BTW, I've started thinking a little more about big picture issues here, and I'm debating whether it's a higher priority to improve performance on power of 2 sizes, or to try to support other radix values.
>>
>> There are two use cases for an FFT that I'm familiar with.  The power-of-two limitation isn't severe in either of them.
>>
>> 1.  Getting an idea of what the spectrum of a signal looks like.  Here, it's common to pad with zeros because the plots become clearer looking, even if your FFT lib doesn't require it.
>>
>> 2.  Computing a convolution.  Here, padding with zeros is necessary anyhow to prevent the signal from being "interpreted" as periodic.
>>
>> Are there any major use cases where the power of two limitation is severe, or should I just focus on optimizing powers of 2 and call it a day?
>>
>> On Mon, Aug 2, 2010 at 10:23 AM, Don Clugston <dclugston at googlemail.com<mailto: dclugston at googlemail.com>> wrote:
>>
>>    On 2 August 2010 15:41, David Simcha <dsimcha at gmail.com
>>    <mailto:dsimcha at gmail.com>> wrote:
>>     > Unfortunately I just downloaded the benchmark program for FFTW
>>    and my FFT is
>>     > a ton slower, depending on how you look at it.  Using size 2^20 as
>> my
>>     > benchmark, FFTW takes about 131 seconds to create its plan, even
>>    using
>>     > -oestimate, the fastest planner.  However, the plan can be reused if
>>     > performing multiple FFTs of the same size, and once the plan is
>>    created, it
>>     > can do an FFT of size 2^20 in about 53 milliseconds (which I find
>>    almost
>>     > unbelievable because even sorting an array of size 2^20 using a
>>     > well-optimized quick sort takes almost that long, and FFT seems
>>    like it
>>     > should should have a much larger constant than quick sort),
>>    compared to my
>>     > latest improvements to around 730 on the hardware I'm
>>    benchmarking on.
>>     >
>>     > On the other hand, for one-off use cases, the lack of needing to
>>    create a
>>     > plan is a big win, both from a speed and a simplicity of API
>>    point of view.
>>     >  Benchmarking against R, which doesn't appear to use plans,
>>    making the
>>     > comparison somewhat more relevant, things look better for my FFT:
>>     R takes
>>     > about 610 milliseconds for a size 2^20 pure real FFT.
>>
>>    All you're seeing is the L2 cache. Did you see the link I posted to
>>    the NG about the internals of FFTW? The graph at the top shows FFTW is
>>    40 times faster than the 'numerical recipes' code for 2^^20. So your
>>    factor of 13 isn't so bad. Based on that graph, if you reduce it to
>>    say 2^^15, the factor should drop significantly. Adding a little bit
>>    of cache awareness (using core.cpuid) should be able to avoid the
>>    performance cliff.
>>    Also, DMD's floating point optimiser is so primitive, you lose up to a
>>    factor of two immediately.
>>    _______________________________________________
>>    phobos mailing list
>>    phobos at puremagic.com <mailto:phobos at puremagic.com>
>>
>>    http://lists.puremagic.com/mailman/listinfo/phobos
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100802/89b814c3/attachment-0001.html>
August 02, 2010
On 2 August 2010 19:49, David Simcha <dsimcha at gmail.com> wrote:
> Oh, also, I don't think that cache effects are the main bottleneck because switching to single-precision floats for both input and output has a negligible effect on performance even though it cuts the size of the working set in half.

Interesting. Still, I think that because of the way FFT works, once you're bigger than the cache, nearly every memory access will be a cache miss. It could be that although the memory footprint halves, the number of cache misses remains constant.

Anyway, the reason I posted the link was not so much to help with implementation, but more because it gives a great feel for what's involved in a "state of the art" FFT library. I suspect there's a sweet spot with high convenience, small code size, and good-enough performance.
August 02, 2010
Yea, I also just ran it on a machine with a 4MB L2 cache instead of the 512KB one I had been using.  These machines usually are about equally fast for purely CPU-bound stuff in my experience, and that holds here, too: They're within 5% of each other.

On Mon, Aug 2, 2010 at 3:37 PM, Don Clugston <dclugston at googlemail.com>wrote:

> On 2 August 2010 19:49, David Simcha <dsimcha at gmail.com> wrote:
> > Oh, also, I don't think that cache effects are the main bottleneck
> because
> > switching to single-precision floats for both input and output has a negligible effect on performance even though it cuts the size of the
> working
> > set in half.
>
> Interesting. Still, I think that because of the way FFT works, once you're bigger than the cache, nearly every memory access will be a cache miss. It could be that although the memory footprint halves, the number of cache misses remains constant.
>
> Anyway, the reason I posted the link was not so much to help with
> implementation, but more because it gives a great feel for what's
> involved in a "state of the art" FFT library. I suspect there's a
> sweet spot with high convenience, small code size, and good-enough
> performance.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100802/59498c53/attachment.html>
August 02, 2010

Don Clugston wrote:
>
> Anyway, the reason I posted the link was not so much to help with implementation, but more because it gives a great feel for what's involved in a "state of the art" FFT library. I suspect there's a sweet spot with high convenience, small code size, and good-enough performance.
>
> 

The most important thing for a Phobos implementation of FFT (or any other module) is getting the API right. Having that right means we can always swap it out for a better implementation without breaking user code.
August 02, 2010

Walter Bright wrote:
>
> The most important thing for a Phobos implementation of FFT (or any other module) is getting the API right. Having that right means we can always swap it out for a better implementation without breaking user code.
>

For FFT this would mean looking at the best FFT implementations out there to see what their API is. Phobos' should support a full feature set, even if for now the implementation will throw exceptions for things it doesn't support yet.
August 02, 2010
Well then we would need to define an API for matrices or something to define how multidimensional transforms are going to work.  In dstats, I've just been using ranges of ranges or tuples of ranges because it's available and works reasonably well, but I haven't really thought in detail about whether this is the optimal solution.

Also, an N-D transform is apparently trivially implementable in terms of a 1-D transform, so I don't know whether an API for this would be a must-have.

On 8/2/2010 9:18 PM, Walter Bright wrote:
>
>
> Walter Bright wrote:
>>
>> The most important thing for a Phobos implementation of FFT (or any other module) is getting the API right. Having that right means we can always swap it out for a better implementation without breaking user code.
>>
>
> For FFT this would mean looking at the best FFT implementations out
> there to see what their API is. Phobos' should support a full feature
> set, even if for now the implementation will throw exceptions for
> things it doesn't support yet.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>