Performance of tables slower than built in? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Performance of tables slower than built in?

Thread overview

Performance of tables slower than built in?
May 22, 2019 JS
May 22, 2019 Adam D. Ruppe
May 22, 2019 matheus
May 23, 2019 Alex
May 23, 2019 Timon Gehr
May 23, 2019 Alex
May 23, 2019 Ola Fosheim Grøstad
May 22, 2019 rikki cattermole
May 22, 2019 Basile B.
May 22, 2019 Basile B.
May 23, 2019 Alex
May 24, 2019 Basilez B.
May 23, 2019 Ola Fosheim Grøstad
May 23, 2019 Alex
May 23, 2019 Ola Fosheim Grøstad
May 23, 2019 Ola Fosheim Grøstad
May 23, 2019 Alex
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Alex
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Alex
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Alex
May 24, 2019 Ola Fosheim Grøstad
May 25, 2019 NaN
May 25, 2019 Ola Fosheim Grøstad
May 25, 2019 NaN
May 25, 2019 Ola Fosheim Grøstad
May 25, 2019 NaN
May 25, 2019 Ola Fosheim Grøstad
May 25, 2019 Ola Fosheim Grøstad
May 25, 2019 NaN
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 24, 2019 Ola Fosheim Grøstad
May 25, 2019 Danny Arends

May 22, 2019

Performance of tables slower than built in?

Posted by JS

JS

I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?


The code below uses matplotlibd but is not necessary. The timing code is lower.
Ideally I'd like to have quadratic interpolation and all that but if the code is going to be slower and less accurate then what is the point?

Surely I'm doing something wrong?

I wasn't able to test this with LDC because it's giving me linker errors for some reason.

I'm also curious as to what D is using internally in the first place.

The generation uses a quarter sin since the waveform is symmetric. This does slow down the code a little but I can't imagine the few ops it uses to do this is more costly than whatever is used for computing sin.





import std.math;
import std.range;
import std.random;
import std.algorithm;
import plt = matplotlibd.pyplot;
import std.stdio, core.stdc.stdlib;


__gshared double[] QuarterSinTab;

auto GenQuarterSinTab(int res = 4*100)
{
	res++;	// Add extra endpoint to make linear interpolation easier/faster
	alias T = typeof(QuarterSinTab[0]);
	if (QuarterSinTab !is null) free(QuarterSinTab.ptr);
	QuarterSinTab = (cast(T*)malloc(T.sizeof*res))[0..res];

	res--;
	for(int i = 0; i < res; i++)
		QuarterSinTab[i] = sin(PI*(i/cast(double)res));	

	QuarterSinTab[$-1] = QuarterSinTab[0];
}

/*
	Uses the tabularized quarter sine and extends it to the full range
*/
auto sinTab(string Method = "Linear")(typeof(QuarterSinTab[0]) x)
{
	alias T = typeof(QuarterSinTab[0]);
	auto len = QuarterSinTab.length - 1;
	auto s = -1;		
	auto m = x - cast(int)x;	
	auto m2 = 1;
	
	if (x < 0) { m = m + 1; s = 1; }
	if ((abs(x) % 2) < 1) s = -s;			
	auto a = m*len;
	auto ai = cast(int)(m*len);

	switch(Method)
	{	
		default:
		case "Linear":
			auto f = a - ai;
			return s*((1 - f)*QuarterSinTab[ai] + f*QuarterSinTab[ai+1]);
		case "Constant":
			return s*QuarterSinTab[ai];
		
	}
}

void main() {
	GenQuarterSinTab;

	auto x = iota(-10, 10, 0.01).map!(x => x * PI);
    auto y = x.map!("sin(PI*a)");
	auto z = x.map!((x){ return sin(PI*x) - sinTab(x);});
    //plt.plot(x, y, "r-", ["label": "$y=sin(x)$"]);
	plt.plot(x, z, "b-", ["label": "$y$"]);
    plt.xlim(-5, 5);
    plt.ylim(-0.001, 0.001);
    plt.legend();
	plt.show();

	import std.datetime;
	double xxx = 0;
	StopWatch sw;
    sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sinTab(i);
	auto t = sw.peek().msecs;
	writeln(t);
	sw.stop();
	writeln(xxx);
	
	xxx = 0;
	sw.reset();
	sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
	t = sw.peek().msecs;
	writeln(t);
	sw.stop();
}

May 22, 2019

Re: Performance of tables slower than built in?

Posted by Adam D. Ruppe
in reply to JS

Adam D. Ruppe

Posted in reply to JS

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?

There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.

It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.

May 22, 2019

Re: Performance of tables slower than built in?

Posted by rikki cattermole
in reply to JS

rikki cattermole

Posted in reply to JS

On 22/05/2019 12:22 PM, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
> 
> 
> The code below uses matplotlibd but is not necessary. The timing code is lower.
> Ideally I'd like to have quadratic interpolation and all that but if the code is going to be slower and less accurate then what is the point?
> 
> Surely I'm doing something wrong?
> 
> I wasn't able to test this with LDC because it's giving me linker errors for some reason.
> 
> I'm also curious as to what D is using internally in the first place.

It should be the libc's.

With regards to your code, it looks like your computation is wrong.
I gave it a go (although refactored) and it just wouldn't output the right set of values. So performance isn't all that important right now.

May 22, 2019

Re: Performance of tables slower than built in?

Posted by matheus
in reply to Adam D. Ruppe

matheus

Posted in reply to Adam D. Ruppe

On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>
> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.

Exactly, and by the way this is old feature already.

I remember like long time ago when I was studying/writing some games and I decided to test pre-computed arrays (LUT) for SIN/COS vs functions, and the later would beat the arrays pretty easily.

And by the way when porting old games, sometimes you usually (If not change the game logic too much), get rid of the LUT and use functions directly.

Matheus.

May 22, 2019

Re: Performance of tables slower than built in?

Posted by Basile B.
in reply to JS

Basile B.

Posted in reply to JS

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> [...]

Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}

Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.

You'll directly see a 15% gain after refactoring the switch.

May 22, 2019

Re: Performance of tables slower than built in?

Posted by Basile B.
in reply to Basile B.

Basile B.

Posted in reply to Basile B.

On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>
>> [...]
>
> Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}

Oh no... I meant "==" obviously

> Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.
>
> You'll directly see a 15% gain after refactoring the switch.

May 23, 2019

Re: Performance of tables slower than built in?

Posted by Alex
in reply to Basile B.

Alex

Posted in reply to Basile B.

On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>
>> [...]
>
> Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}
>
> Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.
>
> You'll directly see a 15% gain after refactoring the switch.

Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table.

May 23, 2019

Re: Performance of tables slower than built in?

Posted by Alex
in reply to Adam D. Ruppe

Alex

Posted in reply to Adam D. Ruppe

On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>
> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.

Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles?

And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff).

Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?

May 23, 2019

Re: Performance of tables slower than built in?

Posted by Timon Gehr
in reply to Alex

Timon Gehr

Posted in reply to Alex

On 23.05.19 12:21, Alex wrote:
> On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
>> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>
>> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>>
>> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
> 
> Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles?
> ...

Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.)

I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance.

> And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff).
> 
> Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?

Not exactly what you asked, but this might help:
https://www.agner.org/optimize

Also, look up the CORDIC algorithm.

May 23, 2019

Re: Performance of tables slower than built in?

Posted by Ola Fosheim Grøstad
in reply to JS

Ola Fosheim Grøstad

Posted in reply to JS

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?

Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU.

LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache).

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation