Jump to page: 1 25  
Page
Thread overview
Performance of tables slower than built in?
May 22, 2019
JS
May 22, 2019
Adam D. Ruppe
May 22, 2019
matheus
May 23, 2019
Alex
May 23, 2019
Timon Gehr
May 23, 2019
Alex
May 22, 2019
rikki cattermole
May 22, 2019
Basile B.
May 22, 2019
Basile B.
May 23, 2019
Alex
May 24, 2019
Basilez B.
May 23, 2019
Alex
May 23, 2019
Alex
May 24, 2019
Alex
May 24, 2019
Alex
May 24, 2019
Alex
May 25, 2019
NaN
May 25, 2019
NaN
May 25, 2019
NaN
May 25, 2019
NaN
May 25, 2019
Danny Arends
May 22, 2019
I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?


The code below uses matplotlibd but is not necessary. The timing code is lower.
Ideally I'd like to have quadratic interpolation and all that but if the code is going to be slower and less accurate then what is the point?

Surely I'm doing something wrong?

I wasn't able to test this with LDC because it's giving me linker errors for some reason.

I'm also curious as to what D is using internally in the first place.

The generation uses a quarter sin since the waveform is symmetric. This does slow down the code a little but I can't imagine the few ops it uses to do this is more costly than whatever is used for computing sin.





import std.math;
import std.range;
import std.random;
import std.algorithm;
import plt = matplotlibd.pyplot;
import std.stdio, core.stdc.stdlib;


__gshared double[] QuarterSinTab;

auto GenQuarterSinTab(int res = 4*100)
{
	res++;	// Add extra endpoint to make linear interpolation easier/faster
	alias T = typeof(QuarterSinTab[0]);
	if (QuarterSinTab !is null) free(QuarterSinTab.ptr);
	QuarterSinTab = (cast(T*)malloc(T.sizeof*res))[0..res];

	res--;
	for(int i = 0; i < res; i++)
		QuarterSinTab[i] = sin(PI*(i/cast(double)res));	

	QuarterSinTab[$-1] = QuarterSinTab[0];
}

/*
	Uses the tabularized quarter sine and extends it to the full range
*/
auto sinTab(string Method = "Linear")(typeof(QuarterSinTab[0]) x)
{
	alias T = typeof(QuarterSinTab[0]);
	auto len = QuarterSinTab.length - 1;
	auto s = -1;		
	auto m = x - cast(int)x;	
	auto m2 = 1;
	
	if (x < 0) { m = m + 1; s = 1; }
	if ((abs(x) % 2) < 1) s = -s;			
	auto a = m*len;
	auto ai = cast(int)(m*len);

	switch(Method)
	{	
		default:
		case "Linear":
			auto f = a - ai;
			return s*((1 - f)*QuarterSinTab[ai] + f*QuarterSinTab[ai+1]);
		case "Constant":
			return s*QuarterSinTab[ai];
		
	}
}

void main() {
	GenQuarterSinTab;

	auto x = iota(-10, 10, 0.01).map!(x => x * PI);
    auto y = x.map!("sin(PI*a)");
	auto z = x.map!((x){ return sin(PI*x) - sinTab(x);});
    //plt.plot(x, y, "r-", ["label": "$y=sin(x)$"]);
	plt.plot(x, z, "b-", ["label": "$y$"]);
    plt.xlim(-5, 5);
    plt.ylim(-0.001, 0.001);
    plt.legend();
	plt.show();

	import std.datetime;
	double xxx = 0;
	StopWatch sw;
    sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sinTab(i);
	auto t = sw.peek().msecs;
	writeln(t);
	sw.stop();
	writeln(xxx);
	
	xxx = 0;
	sw.reset();
	sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
	t = sw.peek().msecs;
	writeln(t);
	sw.stop();
}
May 22, 2019
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?

There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.

It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
May 22, 2019
On 22/05/2019 12:22 PM, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
> 
> 
> The code below uses matplotlibd but is not necessary. The timing code is lower.
> Ideally I'd like to have quadratic interpolation and all that but if the code is going to be slower and less accurate then what is the point?
> 
> Surely I'm doing something wrong?
> 
> I wasn't able to test this with LDC because it's giving me linker errors for some reason.
> 
> I'm also curious as to what D is using internally in the first place.

It should be the libc's.

With regards to your code, it looks like your computation is wrong.
I gave it a go (although refactored) and it just wouldn't output the right set of values. So performance isn't all that important right now.
May 22, 2019
On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>
> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.

Exactly, and by the way this is old feature already.

I remember like long time ago when I was studying/writing some games and I decided to test pre-computed arrays (LUT) for SIN/COS vs functions, and the later would beat the arrays pretty easily.

And by the way when porting old games, sometimes you usually (If not change the game logic too much), get rid of the LUT and use functions directly.

Matheus.
May 22, 2019
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> [...]

Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}

Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.

You'll directly see a 15% gain after refactoring the switch.
May 22, 2019
On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>
>> [...]
>
> Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}

Oh no... I meant "==" obviously

> Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.
>
> You'll directly see a 15% gain after refactoring the switch.


May 23, 2019
On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>
>> [...]
>
> Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}
>
> Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.
>
> You'll directly see a 15% gain after refactoring the switch.

Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table.


May 23, 2019
On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>
> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.

Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles?

And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff).

Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?
May 23, 2019
On 23.05.19 12:21, Alex wrote:
> On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
>> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>
>> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>>
>> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
> 
> Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles?
> ...

Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.)

I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance.

> And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff).
> 
> Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?

Not exactly what you asked, but this might help:
https://www.agner.org/optimize

Also, look up the CORDIC algorithm.
May 23, 2019
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?

Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU.

LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache).


« First   ‹ Prev
1 2 3 4 5