May 24, 2019
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> 	for(int i = 0; i < res; i++)
> 		QuarterSinTab[i] = sin(PI*(i/cast(double)res));	

Btw, I think this creates a half sine, not a quarter, so you want (?):

     QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res));

> 	QuarterSinTab[$-1] = QuarterSinTab[0];

This creates a discontinuity if you create a quarter sine, in that case you probably wanted:

       QuarterSinTab[$-1] = sin(PI*0.5)

Otherwise you will never get 1 or -1.

But none of these affect performance.

May 24, 2019
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
> On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
>> Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.
>
> Well, when I run this modified version:
>
> https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c
>
> on https://run.dlang.io/
>
> then I get:
>
> LUT:    709
> sin(x): 2761
>
> So the LUT is 3-4 times faster even with your quarter-LUT overhead.

FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these:

auto fastQuarterLookup(double x){
    const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) );
    const double sign = cast(double)(-cast(uint)((mantissa>>63)&1));
    … etc

So it seems like a quarter-wave LUT is 27 times faster than sin…

You just have to make sure that the generated instructions fills the entire CPU pipeline.


May 24, 2019
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
> On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
>> On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
>>> Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.
>>
>> Well, when I run this modified version:
>>
>> https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c
>>
>> on https://run.dlang.io/
>>
>> then I get:
>>
>> LUT:    709
>> sin(x): 2761
>>
>> So the LUT is 3-4 times faster even with your quarter-LUT overhead.
>
> FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these:
>
> auto fastQuarterLookup(double x){
>     const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) );
>     const double sign = cast(double)(-cast(uint)((mantissa>>63)&1));
>     … etc
>
> So it seems like a quarter-wave LUT is 27 times faster than sin…
>
> You just have to make sure that the generated instructions fills the entire CPU pipeline.


Well, the QuarterWave was suppose to generate just a quarter since that is all that is required for these functions due to symmetry and periodicity. I started with a half to get that working then figure out the sign flipping.

Essentially one just has to tabulate a quarter of sin, that is, from 0 to 90o and then get the sin right. This allows one to have 4 times the resolution or 1/4 the size at the same cost.

Or, to put it another say, sin as 4 fold redundancy.

I'll check out your code, thanks for looking in to it.
May 24, 2019
On Friday, 24 May 2019 at 12:01:55 UTC, Alex wrote:
> Well, the QuarterWave was suppose to generate just a quarter since that is all that is required for these functions due to symmetry and periodicity. I started with a half to get that working then figure out the sign flipping.

Sure, it is a tradeoff. You pollute the cache less this way, but you have to figure out the sign and the lookup-direction.

The trick is then to turn the phase into an unsigned integer then you get:

1. the highest bit will tell you that you need to use the inverse sign for the result.
2. the next highest bit will tell you that you need too look up in the reverse direction

What is key to performance here is that x86 can do many simple integer/bit operations in parallel, but only a few floating point operations.

Also avoid all conditionals. Use bitmasking instead, something along the line of:

const ulong phase = mantissa^((1UL<<63)-((mantissa>>62)&1));
const uint quarterphase = (phase>>53)&511;

(Haven't checked the correctness of this, but this shows the general principle.)

Ola.





May 24, 2019
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
> On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
>> On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
>>> Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.
>>
>> Well, when I run this modified version:
>>
>> https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c
>>
>> on https://run.dlang.io/
>>
>> then I get:
>>
>> LUT:    709
>> sin(x): 2761
>>
>> So the LUT is 3-4 times faster even with your quarter-LUT overhead.
>
> FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these:
>
> auto fastQuarterLookup(double x){
>     const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) );
>     const double sign = cast(double)(-cast(uint)((mantissa>>63)&1));
>     … etc
>
> So it seems like a quarter-wave LUT is 27 times faster than sin…
>

If so then that is great and what I'd expected to achieve originally.

I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out.

> You just have to make sure that the generated instructions fills the entire CPU pipeline.

What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know.

Are you saying that I did not have enough instructions that the pipeline could take advantage of?


In any case, I'll check your code out and try to figure out the details and see exactly what is going on.

If it truly is a 27x faster then then that is very relevant and knowing why is important.

Of course, a lot of that might simply be due to LDC and I wasn't able to determine this.

Can you do some stats for dmd and ldc?

You seem to be interested in this, are you up for a challenge?


The idea is to use tables to optimize these functions.

Half sin was done above but quarter sine can be used(there are 4 quadrants but only one has to be tabularized because all the others differ by sign and reversion(1 - x), it's a matter of figuring out the sign).

Of course it requires extra computation so it would be interesting to see the difference in performance for the extra logic.

Then there is exp

exp(x) can be written as exp(floor(x) + {x}) = exp(floor(x))*exp({x})

and so one can optimize this by tabulating exp(x) for 0<= x < 1 which is the fractional part of exp(x).

Then tabulating it for a wide range of integers(probably in 2's).


e.g.,

exp(3.5) = exp(3)*exp(.5)

both come from a lookup table.

or one could do

exp(3) = exp(1 + 1 + 1) = exp(1)*exp(1)*exp(1)

(this requires iteration if we do not tabulate exp(3).

Hence one would limit the iteration count by tabulating things like exp(10^k) and exp(k) for -10 < k < 10.

The idea is basically one can get really dense LUT's for a small range that then are used to build the function for arbitrary inputs.

With linear interpolation one can get very accurate(for all practical purposes) LUT table methods that, if your code is right, is at least an order of magnitude faster. The memory requirements will be quite small with linear interpolation(and ideally quadratic or cubic if the costs are not too high).

That was what I was starting to work on before I got thrown off by it being much slower.

It seems you already have the half-sin done.

One could do things like sin, cos(obviously easy), exp, exp()^2, erf(x), sinh, cosh, etc. Things like sec could also be done as it would save a division since it seems they take about 30 cycles. But it would depend on the memory used.

[I can't mess with this now because I've gotta work other things at the moment]

Thanks.


















May 24, 2019
On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
>> So it seems like a quarter-wave LUT is 27 times faster than sin…
>>
>
> If so then that is great and what I'd expected to achieve originally.
>
> I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out.

Yes, the gist linked above is just your code with minor changes, that was 4-5 times faster. To get to 27 times faster you need to use the integer bit-manipulation scheme that I suggest above. Just beware that I haven't checked the code, so it might be off by ±1 and such.

Anyway, it is more fun for you to code up your own version than to try to figure out mine. Just follow the principles and you should get close to that performance, I think. (I'll refine the code later, but don't really have time now)


>> You just have to make sure that the generated instructions fills the entire CPU pipeline.
>
> What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know.

Yes, you have to look up information about the CPU in your computer. Each core has a set of "lanes" that are computed simultanously. Some instructions can go into many lanes, but not all. Then there might be bubbles in the pipeline (the lane) that can be filled up with integer/bit manipulation instructions. It is tedious to look that stuff up. So, last resort. Just try to mix simple integer with simple double computations (avoid division).


> Are you saying that I did not have enough instructions that the pipeline could take advantage of?

Yes, you most likely got bubbles. Empty space where the core has nothing to send down a lane, because it is waiting for some computation to finish so that it can figure out what to do next.

Basic optimization:

Step 1: reduce dependencies between computations

Step 2: make sure you generate a mix of simple integer/double instructions that can fill up all the computation lanes at the same time

Step 3: make sure loops only contain a few instructions, the CPU can unroll loops in hardware if they are short. (not valid here though)


> Of course, a lot of that might simply be due to LDC and I wasn't able to determine this.

I think I got better performance  because I filled more lanes in the pipeline.


> Half sin was done above but quarter sine can be used(there are 4 quadrants but only one has to be tabularized because all the others differ by sign and reversion(1 - x), it's a matter of figuring out the sign).

Yes, as I mentioned, the first bit of the phase is the sign and the second bit of the phase is the reversion of the indexing.


> Of course it requires extra computation so it would be interesting to see the difference in performance for the extra logic.

It adds perhaps 2-5 cycles or so, my guessing.


> exp(x) can be written as exp(floor(x) + {x}) = exp(floor(x))*exp({x})
[...]
> With linear interpolation one can get very accurate(for all practical purposes) LUT table methods that, if your code is right, is at least an order of magnitude faster. The memory requirements will be quite small with linear interpolation

I think you need to do something with the x before you look up, so that you have some kind of fast nonlinear mapping to the indexes.

But for exp() you might prefer an approximation instead, perhaps polynomial taylor series perhaps.

Searching the web should give some ideas.


> It seems you already have the half-sin done.

I did the quarter sin though, not the half-sin (but that is almost the same, just drop the reversion of the indexing).

(Let's talk about this later, since we both have other things on our plate. Fun topic! :-)

May 24, 2019
On Friday, 24 May 2019 at 12:56:55 UTC, Ola Fosheim Grøstad wrote:
> suggest above. Just beware that I haven't checked the code, so it might be off by ±1 and such.

So before using such code for anything important, make sure that it is tested for the edge cases, like denormal numbers (values close to zero). Roundoff-errors where computations "accidentally" cross over 1.0 and stuff like that.

May 24, 2019
On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
> If it truly is a 27x faster then then that is very relevant and knowing why is important.
>
> Of course, a lot of that might simply be due to LDC and I wasn't able to determine this.

Just one more thing you really ought to consider:

It isn't obvious that a LUT using double will be more precise than computing sin using single precision float.

So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster.

Perhaps not worth it then. You might as well just use float instead of double.

The downside is that -ffast-math makes all your computations more inaccurate.

It is also possible that recent processors can do multiple sin/cos as simd.

Too many options… I know.

May 24, 2019
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
>     const double sign = cast(double)(-cast(uint)((mantissa>>63)&1));

Yep, this was wrong (0 or -1). Should be something like (1 or -1):

    const double sign = cast(double)(1-cast(uint)((mantissa>>62)&2));

You'll have to code it up more carefully than I did, just following the same principles.  (These ±1 errors do not affect the performance.).

Also, for comparison, just running the 2 lookups in the loop are at 32ms.

So 3 times that sounds reasonable for extracting the phase, determining the sign, reversing the phase and doing the linear interpolation.




May 24, 2019
This appears to get roughly the same results as sin(x):


__gshared double[512+1] QuarterSinTab;

void init(){
    const auto res = QuarterSinTab.length-1;
    for(int i = 0; i < res; i++)
        QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res));	
	QuarterSinTab[$-1] = sin(PI*0.5);
}

auto fastQuarterLookup(double x){
    const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<54)) );
    const double sign = cast(double)(1 - cast(int)((mantissa>>52)&2));
    const ulong phase = (mantissa^((1UL<<53)-((mantissa>>52)&1)))&((1UL<<53) -1);
    const uint quarterphase = (phase>>43)&511;
    const double frac = cast(double)(phase&((1UL<<43)-1))*cast(double)(1.0/(1UL<<43));
    return sign*((1.0-frac)*QuarterSinTab[quarterphase] + frac*QuarterSinTab[quarterphase+1]);
}


Ola.