Thread overview
Efficient idiom for fastest code
May 23, 2018
Nicholas Wilson
May 24, 2018
Nicholas Wilson
May 24, 2018
biocyberman
May 23, 2018
Malte
May 23, 2018
Many times in expensive loops one must make decisions. Decisions must be determined and the determination costs.

for(int i = 0; i < N; i++)
{
    if(decision(i)) A; else B;
}

the if statement costs N times the cycle cost.

In some cases the decision holds for continuous ranges. For some 0 <= n <= N the decision is constant, but n is arbitrary(determined by unknown factors at compile time).

One can speed up the routine by using something akin to a simplified strategy pattern where one uses functions/delegates/lambdas to code a faster version without the test:


for(int i = 0; i < N; i++)
{
    d = (){ if(decision(i)) A; else d = () { B; };
    d();
}

this code basically reduces to

for(int i = 0; i < N; i++)
{
    B;
}

Once the decision fails, which we assume once it fails it always fails in this particular case.

Therefor, once the decision fails it kicks in the "faster" code. Suppose decision is very slow.

The cycle cost is then n times rather than N times.

In general, such techniques can be used to speed up code, sometimes significantly, but are difficult to implement using a standard pattern and for more complex decision functions.


Does anyone see any way to use some some of D's power to help simplify such things but not using something like a strategy pattern or complexifying the overall design(using advanced techniques such as templates, mixins is not making the code more complex).














May 23, 2018
On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions wrote:
>
> Many times in expensive loops one must make decisions. Decisions must be determined and the determination costs.
>
> for(int i = 0; i < N; i++)
> {
>     if(decision(i)) A; else B;
> }
>
> the if statement costs N times the cycle cost.
>
> In some cases the decision holds for continuous ranges. For some 0 <= n <= N the decision is constant, but n is arbitrary(determined by unknown factors at compile time).
>
> One can speed up the routine by using something akin to a simplified strategy pattern where one uses functions/delegates/lambdas to code a faster version without the test:
>
>
> for(int i = 0; i < N; i++)
> {
>     d = (){ if(decision(i)) A; else d = () { B; };
>     d();
> }
>

assuming you meant
> auto d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
> for(int i = 0; i < N; i++)
> {
>     d(i);
> }
>

> this code basically reduces to
>
> for(int i = 0; i < N; i++)
> {
>     B;
> }
>
> Once the decision fails, which we assume once it fails it always fails in this particular case.
>
> Therefor, once the decision fails it kicks in the "faster" code. Suppose decision is very slow.
>
> The cycle cost is then n times rather than N times.
>
> In general, such techniques can be used to speed up code, sometimes significantly, but are difficult to implement using a standard pattern and for more complex decision functions.
>
>
> Does anyone see any way to use some some of D's power to help simplify such things but not using something like a strategy pattern or complexifying the overall design(using advanced techniques such as templates, mixins is not making the code more complex).

If decision is pure, consider using static foreach (iota(N).map!decision) { ... }to unroll it at compile time. even if it isn't the compiler may still be able to factor out parts of repeated calls to decision, look at the assembly/IR to confirm this.

Otherwise PROFILE!!!!! (and use profile guided optimisation! this implies using a compiler other than DMD which if you care about performance you should be doing anyway)

Blindly trying to optimise is just as likely to hurt performance.
in particular don't underestimate the branch predictor. Even if decision is complex, if there is a pattern in evaluating iota(n).map!decision the branch predictor will pick up on it.

In terms of exploiting knowledge of decision a priori that the compiler somehow lacks you really don't have much option but to do it yourself.


May 23, 2018
On Wednesday, 23 May 2018 at 03:00:17 UTC, Nicholas Wilson wrote:
> On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions wrote:
>>
>> Many times in expensive loops one must make decisions. Decisions must be determined and the determination costs.
>>
>> for(int i = 0; i < N; i++)
>> {
>>     if(decision(i)) A; else B;
>> }
>>
>> the if statement costs N times the cycle cost.
>>
>> In some cases the decision holds for continuous ranges. For some 0 <= n <= N the decision is constant, but n is arbitrary(determined by unknown factors at compile time).
>>
>> One can speed up the routine by using something akin to a simplified strategy pattern where one uses functions/delegates/lambdas to code a faster version without the test:
>>
>>
>> for(int i = 0; i < N; i++)
>> {
>>     d = (){ if(decision(i)) A; else d = () { B; };
>>     d();
>> }
>>
>
> assuming you meant
>> auto d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
>> for(int i = 0; i < N; i++)
>> {
>>     d(i);
>> }
>>
>
>> this code basically reduces to
>>
>> for(int i = 0; i < N; i++)
>> {
>>     B;
>> }
>>
>> Once the decision fails, which we assume once it fails it always fails in this particular case.
>>
>> Therefor, once the decision fails it kicks in the "faster" code. Suppose decision is very slow.
>>
>> The cycle cost is then n times rather than N times.
>>
>> In general, such techniques can be used to speed up code, sometimes significantly, but are difficult to implement using a standard pattern and for more complex decision functions.
>>
>>
>> Does anyone see any way to use some some of D's power to help simplify such things but not using something like a strategy pattern or complexifying the overall design(using advanced techniques such as templates, mixins is not making the code more complex).
>
> If decision is pure, consider using static foreach (iota(N).map!decision) { ... }to unroll it at compile time. even if it isn't the compiler may still be able to factor out parts of repeated calls to decision, look at the assembly/IR to confirm this.
>
> Otherwise PROFILE!!!!! (and use profile guided optimisation! this implies using a compiler other than DMD which if you care about performance you should be doing anyway)
>
> Blindly trying to optimise is just as likely to hurt performance.
> in particular don't underestimate the branch predictor. Even if decision is complex, if there is a pattern in evaluating iota(n).map!decision the branch predictor will pick up on it.
>
> In terms of exploiting knowledge of decision a priori that the compiler somehow lacks you really don't have much option but to do it yourself.

I knew someone was going to say that and I forgot to say DON'T!

Saying to profile when I clearly said these ARE cases where they are slow is just moronic. Please don't use default answers to arguments.

This was a general question about cases on how to attack a problem WHEN profiling says I need to optimize. Your SO 101 answer sucks! Sorry!

To prove to you that your answer is invalid:

I profile my code, it says that it is very slow and shows that it is do to the decision checking... I then I have to come here and write up a post trying to explain how to solve the problem. I then get a post telling me I should profile. I then respond I did profile and that this is my problem. A lot of wasted energy when it is better to know a general attack strategy. Yes, some of us can judge if code is needed to be optimized before profiling. It is not difficult. Giving a generic answer that always does not apply and is obvious to anyone trying to do optimization is not helpful. Everyone today pretty must does not even optimize code anymore... this isn't 1979. It's not ok to keep repeating the same mantra. I guess we should turn this in to a meme?

The reason I'm getting on to you is that the "profile before optimization" sounds a bit grade school, specially since I wasn't talking anything about profiling but a general programming pattern speed up code, which is always valid but not always useful(and hence that is when profiling comes in).

May 23, 2018
On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions wrote:
> In some cases the decision holds for continuous ranges. For some 0 <= n <= N the decision is constant, but n is arbitrary(determined by unknown factors at compile time).
>
> One can speed up the routine by using something akin to a simplified strategy pattern where one uses functions/delegates/lambdas to code a faster version without the test:
>
>
> for(int i = 0; i < N; i++)
> {
>     d = (){ if(decision(i)) A; else d = () { B; };
>     d();
> }
>
> this code basically reduces to
>
> for(int i = 0; i < N; i++)
> {
>     B;
> }
>
> Once the decision fails, which we assume once it fails it always fails in this particular case.
>
> Therefor, once the decision fails it kicks in the "faster" code. Suppose decision is very slow.

I would just do
>int i=0;
>for(;decision(i) && i < N; i++)
>{
>    A;
>}
>for(;i < N; i++)
>{
>    B;
>}

This could be turned to a mixin template with something like this:
>mixin template forSplit(alias condition, alias A, alias B)
>{
>    void execute()
>    {
>        int i = 0;
>        for (; condition(i) && i < N; i++)
>        {
>            A();
>        }
>        for (; i < N; i++)
>        {
>            B();
>        }
>    }
>}

and to use it in code (assuming N is defined in the scope)
>mixin forSplit!((int i)=>(decision(i)), {A;}, {B;}) loop;
>loop.execute();

I have't measured anything, but I would assume that delegates come with an overhead that you just don't need here. In fact when trying to use
> auto d = (int i) {};
> d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
> for(int i = 0; i < N; i++)
> {
>     d(i);
> }
All I got was "cannot access frame of function D main" which sums up my experiences with lambdas in D so far.

While PGO and branch predictor are good, they don't help much here. Not executing an expression is out of scope for them. All they do is prevent pipeline flushes.
Also I think you overestimate what the compiler can do. My decision function to do some testing was this:
>bool decision(int a) pure
>out (result) {
>    	assert(result == (a < 10));
>} do {
>    import std.algorithm, std.range;
> 
>    // stolen from https://dlang.org/library/std/parallelism.html
>    enum n = 1_000_000;
>    enum delta = 1.0 / n;
>    alias getTerm = (int i)
>    {
>        immutable x = ( i - 0.5 ) * delta;
>        return delta / ( 1.0 + x * x ) ;
>    };
>    immutable pi = 4.0 * reduce!"a + b"(n.iota.map!getTerm);
> 
>    return a < 3*pi;
>}
With N=100 I got a speedup of ~10 (ldc -O3 -release). Even though this function is pure and could be optimized a lot. It calculated pi for every single call. And optimizing the decision function isn't even the point of that question.
May 23, 2018
On Wednesday, 23 May 2018 at 10:55:02 UTC, Malte wrote:
> On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions wrote:
>> [...]
>
> I would just do
>>[...]
>
> [...]


Thanks, I didn't think about using a for loop like that. While it is not the most general it does solve the specific case for a simple step/toggled decision.


May 24, 2018
On Wednesday, 23 May 2018 at 03:12:52 UTC, IntegratedDimensions wrote:
> I knew someone was going to say that and I forgot to say DON'T!
>
> Saying to profile when I clearly said these ARE cases where they are slow is just moronic. Please don't use default answers to arguments.
>
> This was a general question about cases on how to attack a problem WHEN profiling says I need to optimize. Your SO 101 answer sucks! Sorry!
>
> To prove to you that your answer is invalid:
>
> I profile my code, it says that it is very slow and shows that it is do to the decision checking... I then I have to come here and write up a post trying to explain how to solve the problem. I then get a post telling me I should profile. I then respond I did profile and that this is my problem. A lot of wasted energy when it is better to know a general attack strategy. Yes, some of us can judge if code is needed to be optimized before profiling. It is not difficult. Giving a generic answer that always does not apply and is obvious to anyone trying to do optimization is not helpful. Everyone today pretty must does not even optimize code anymore... this isn't 1979. It's not ok to keep repeating the same mantra. I guess we should turn this in to a meme?
>
> The reason I'm getting on to you is that the "profile before optimization" sounds a bit grade school, specially since I wasn't talking anything about profiling but a general programming pattern speed up code, which is always valid but not always useful(and hence that is when profiling comes in).

I'm going to ignore the tone of your response, but I am going to say that responding like that isn't going to get you very far. Don't expect others to do likewise.

Assuming that your decision function is indeed the bottleneck, you'll see I did actually provide some hints as to how to optimise the case where decision is pure. even if you can't convince the compiler to inline and expression combine, as in the case for the other answer, you can memoize it (look in std.functional).

One of the great things about D is that you can force lots of computation to happen at compile time, so in the case where decision is impure, factoring it into pure and impure parts and `enum x = pureFunc(args);`ing the part that can be can make a large difference if you can't convince the optimiser to do it for you.

If you still can't do that consider Jitting it  (https://github.com/ldc-developers/druntime/blob/e3bfc5fb780967f1b6807039ff00b2ccaf4b03d9/src/ldc/attributes.d#L78 ) with `-enable-dynamic-compile` or running the loop in parallel with std.parallelism.
May 24, 2018
On Wednesday, 23 May 2018 at 03:12:52 UTC, IntegratedDimensions wrote:
> On Wednesday, 23 May 2018 at 03:00:17 UTC, Nicholas Wilson wrote:
>> [...]
>
> I knew someone was going to say that and I forgot to say DON'T!
>
> Saying to profile when I clearly said these ARE cases where they are slow is just moronic. Please don't use default answers to arguments.
>
> This was a general question about cases on how to attack a problem WHEN profiling says I need to optimize. Your SO 101 answer sucks! Sorry!
>
> To prove to you that your answer is invalid:
>
> I profile my code, it says that it is very slow and shows that it is do to the decision checking... I then I have to come here and write up a post trying to explain how to solve the problem. I then get a post telling me I should profile. I then respond I did profile and that this is my problem. A lot of wasted energy when it is better to know a general attack strategy. Yes, some of us can judge if code is needed to be optimized before profiling. It is not difficult. Giving a generic answer that always does not apply and is obvious to anyone trying to do optimization is not helpful. Everyone today pretty must does not even optimize code anymore... this isn't 1979. It's not ok to keep repeating the same mantra. I guess we should turn this in to a meme?
>
> The reason I'm getting on to you is that the "profile before optimization" sounds a bit grade school, specially since I wasn't talking anything about profiling but a general programming pattern speed up code, which is always valid but not always useful(and hence that is when profiling comes in).

Very challenging. Wish I could help you out with the tough work. People don't share the same context, especially via online, so it is necessary to clarify the  problem so other can understand and help. I've been beaten on stackoverflow many times for not providing sufficient information for my questions. It seems like one can do the reverse here at forum.dlang.org.

With that said, I think you know what you are doing, and you can do it. Just relax and give it more time and experimentation.