May 23, 2019
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
> 	xxx = 0;
> 	sw.reset();
> 	sw.start();
> 	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
> 	t = sw.peek().msecs;
> 	writeln(t);
> 	sw.stop();
> }

What you are doing wrong is that xxx is never used... So DMD removes it altogether?

If you add writeln(xxx) after the second loop as well, then maybe the measurements make more sense?

Ola.



May 23, 2019
On Thursday, 23 May 2019 at 15:20:22 UTC, Timon Gehr wrote:
> On 23.05.19 12:21, Alex wrote:
>> On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
>>> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>>>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>>
>>> There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access.
>>>
>>> It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
>> 
>> Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles?
>> ...
>
> Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.)
>
> I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance.
>

I realize a lot of optimizations could be made but I still find it hard to believe that it could be done faster even with special hardware unless a LUT is used in the hardware already or some highly optimized algorithms.

The whole point of general purpose computing was to get away from specialized hardware because it was cheaper and one could spend the resources in improving raw cycle performance which would benefit the entire system.


But I guess because of complexity of hardware now days it's hard to really figure out what is going on.

http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

Maybe when it is an "intrinsic" it can avoid some of the issues that might otherwise cost in a LUT(such as a context switch).




>> And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff).
>> 
>> Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?
>
> Not exactly what you asked, but this might help:
> https://www.agner.org/optimize
>
> Also, look up the CORDIC algorithm.

I don't see how that can necessarily be faster. A LUT can give full 64-bit precision with one operation. The CORDIC needs iteration, at least 10 to be of any use. LUT's are precision independent assuming the creation cost is not included.

I couldn't find anything in that link that specifically addressed speeding up typical math functions.

It would be nice to know exactly what is going on, which functions are optimized and the typical speed up over a lut.

I have some code that uses sin, exp and a few other primitive algebraic functions and in one case the code is extremely slow(it uses exp).  I haven't done much testing of all this but something just seems off somewhere. Either the sin is heavily optimized and exp is not(which it too have a CORDIC like algorithm since exp(a+b) = exp(a)*exp(b)) or something else is happening that I'm not aware of.. which is why I went to tabularizing these functions in the first place.

Also, I don't know the kinda accuracy they have, which doesn't help much but I could use a more accurate algorithm which is slower since it is pre-computed.

That site I linked does talk about some of the issues and stuff... I guess I just need to update my thinking on modern cpus a little better.


I guess there is no tool that can tell one exactly what is happening to a piece of code in the cpu... basically an instruction level profiler?
May 23, 2019
On Thursday, 23 May 2019 at 19:17:40 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> 	xxx = 0;
>> 	sw.reset();
>> 	sw.start();
>> 	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
>> 	t = sw.peek().msecs;
>> 	writeln(t);
>> 	sw.stop();
>> }
>
> What you are doing wrong is that xxx is never used... So DMD removes it altogether?
>
> If you add writeln(xxx) after the second loop as well, then maybe the measurements make more sense?
>
> Ola.

maybe, I thought I put in the writeln(xxx) in there to do that... yeah, looking at my code it's there, so that is not the problem.

This is the code I had,

	import std.datetime;
	double xxx = 0;
	StopWatch sw;
    sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sinTab(i);
	auto t = sw.peek().msecs;
	writeln(t);
	sw.stop();
	writeln(xxx);
	
	xxx = 0;
	sw.reset();
	sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
	t = sw.peek().msecs;
	writeln(t);
	sw.stop();
	writeln(xxx);

Either I modified it later or deleted one of those lines or the editor screwed up when I pasted it... I don't recall touching the code afterwards.

Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.

May 23, 2019
On Thursday, 23 May 2019 at 18:57:03 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>
> Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU.
>
> LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache).

I've used very small LUT's like a length of 5 and it didn't significantly change anything.

Unless it is being hit hard at the start and that is skewing the results, it doesn't seem to be the issue.

I haven't tested this well but was just thrown off by the results as it should easily have been inverted and I expected quite a significant speed up(several factors) and not the reverse.
May 23, 2019
On Thursday, 23 May 2019 at 21:30:47 UTC, Alex wrote:
> I don't see how that can necessarily be faster. A LUT can give full 64-bit precision with one operation. The CORDIC needs iteration, at least 10 to be of any use. LUT's are precision independent assuming the creation cost is not included.

It isn't, IEEE sin is typically not fast. Arm cpus let you run fewer iterations though for nonstandard floats.

> I have some code that uses sin, exp and a few other primitive algebraic functions and in one case the code is extremely slow(it uses exp).  I haven't done much testing of all this but something just seems off somewhere.

Dont know about exp, but some operations are slow when you get too close to zero, so called denormal numbers.

> I guess there is no tool that can tell one exactly what is happening to a piece of code in the cpu... basically an instruction level profiler?

Vtune from Intel, not free... Afaik.


May 23, 2019
On Thursday, 23 May 2019 at 21:50:38 UTC, Alex wrote:
> I've used very small LUT's like a length of 5 and it didn't significantly change anything.

Use a size that is 2^n, then mask the index and hopefully that will turn off bounds checks.

E.g. If LUT size is 16, then index the lut with "i&15"?

> I haven't tested this well but was just thrown off by the results as it should easily have been inverted and I expected quite a significant speed up(several factors) and not the reverse.

Well, you could take the time times clock frequency, divide it by number of iterations and calculate number of cycles per iteration. If it is more than a dozen, then something is wrong.


May 24, 2019
On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote:
> On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
>> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>>> I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!?
>>>
>>> [...]
>>
>> Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}
>>
>> Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions.
>>
>> You'll directly see a 15% gain after refactoring the switch.
>
> Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table.

Try by yourself but to be clear note that I don't like your attitude, which I find disrespectful. I'm just here to help, I explained you the big problem you have in your code and you start discussing something that's not to be discussed AT ALL. Look at this https://d.godbolt.org/z/vtzVdp, in sinTab you'll be able to see

  call    pure nothrow @nogc @safe int object.__switch!(immutable(char), "Linear", "Constant").__switch(scope const(immutable(char)[]))@PLT

and this even with LDC2 -O3. That's why your LUT is so slow. refactor the switch with "static if", the branch will be eliminated and you'll to see the perf improvement.
May 24, 2019
On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
> Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.

Well, when I run this modified version:

https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c

on https://run.dlang.io/

then I get:

LUT:    709
sin(x): 2761

So the LUT is 3-4 times faster even with your quarter-LUT overhead.



May 24, 2019
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
> So the LUT is 3-4 times faster even with your quarter-LUT overhead.

4-5 times faster actually, since I made the LUT size known at compiletime.
May 24, 2019
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
>
> https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c

I made an error here:

"return s*((1 - f)*QuarterSinTab[ai&511] + f*QuarterSinTab[(ai+1)&511]);"

Should of course be:

return s*((1 - f)*QuarterSinTab[ai&511] + f*QuarterSinTab[(ai&511)+1]);

However that does not impact the performance.