By ref and by pointer kills performance.

By ref and by pointer kills performance.
Feb 13 claptrap
Feb 13 Richard (Rikki) Andrew Cattermole
Feb 13 Johan
Feb 13 Johan
Feb 13 ryuukk_
Feb 13 Richard (Rikki) Andrew Cattermole
Feb 13 Richard (Rikki) Andrew Cattermole
Feb 13 Richard (Rikki) Andrew Cattermole
Feb 16 Kagamin
Feb 16 Richard (Rikki) Andrew Cattermole
Feb 16 Kagamin
Feb 16 Richard (Rikki) Andrew Cattermole
Feb 16 Kagamin
Feb 17 Richard (Rikki) Andrew Cattermole
Feb 18 FeepingCreature
Feb 18 FeepingCreature
Feb 13 Patrick Schluter
Feb 13 Richard (Rikki) Andrew Cattermole
Feb 14 Basile B.
Feb 14 Walter Bright
Feb 14 Richard (Rikki) Andrew Cattermole
Feb 14 Richard (Rikki) Andrew Cattermole
Feb 14 Timon Gehr
Feb 14 Iain Buclaw
Feb 18 max haughton
Feb 18 Richard (Rikki) Andrew Cattermole
Feb 14 Timon Gehr
Feb 13 Bruce Carneal
Feb 13 jmh530
Feb 13 jmh530
Feb 13 Bruce Carneal
Feb 13 jmh530
Feb 13 claptrap
Feb 14 Timon Gehr
Feb 14 Bruce Carneal
Feb 13 Patrick Schluter
Feb 13 deadalnix

February 13

Posted by claptrap

Permalink

claptrap

Permalink

I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC.

https://d.godbolt.org/z/oonq1drd9

void fillBP(uint* value, uint* dest)
{
    dest[0] = *value;
    dest[1] = *value;
    dest[2] = *value;
    dest[3] = *value;
}

codegen DMD -->

            push    RBP
            mov     RBP,RSP
            mov     ECX,[RSI]
            mov     [RDI],ECX
            mov     EDX,[RSI]
            mov     4[RDI],EDX
            mov     R8D,[RSI]
            mov     8[RDI],R8D
            mov     R9D,[RSI]
            mov     0Ch[RDI],R9D
            pop     RBP
            ret

codgen LDC -->

    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 4], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 8], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 12], eax
    ret

void fillBV(uint value, uint* dest)
{
    dest[0] = value;
    dest[1] = value;
    dest[2] = value;
    dest[3] = value;
}

codgen DMD -->

            push    RBP
            mov     RBP,RSP
            mov     [RDI],ESI
            mov     4[RDI],ESI
            mov     8[RDI],ESI
            mov     0Ch[RDI],ESI
            pop     RBP
            ret

codegen LDC -->

    movd    xmm0, edi
    pshufd  xmm0, xmm0, 0
    movdqu  xmmword ptr [rsi], xmm0
    ret

Interestingly if you do this...

void fillBP(uint* value, uint* dest)
{
    uint tmp = *value;
    dest[0] = tmp;
    dest[1] = tmp;
    dest[2] = tmp;
    dest[3] = tmp;
}

You get identical code to the by value versions. (except the load from memory)

I'm not a compiler guy so maybe there's some rationale for this that I don't know but it seems like the compiler should be able to read "*value" once and cache it.

February 13

Re: By ref and by pointer kills performance.

Posted by Richard (Rikki) Andrew Cattermole
in reply to claptrap

Permalink

Richard (Rikki) Andrew Cattermole

Posted in reply to claptrap

Permalink

dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

```d
void fillBP(immutable(uint*) value, uint* dest) {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}
```

I expected that to not do the extra loads, but it did.

```d
void fillBP(immutable(uint*) value, uint* dest) {
	dest[0 .. 4][] = *value;
}
```

And that certainly should not be doing it either.
Even if it wasn't immutable.

For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.

February 13

Re: By ref and by pointer kills performance.

Posted by Bruce Carneal
in reply to claptrap

Permalink

Bruce Carneal

Posted in reply to claptrap

Permalink

On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:

https://d.godbolt.org/z/oonq1drd9

...

I'm not a compiler guy so maybe there's some rationale for this that I don't know but it seems like the compiler should be able to read "*value" once and cache it.

To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the @restrict attribute as seen here:

https://d.godbolt.org/z/z9vYndWqP

When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.

February 13

Re: By ref and by pointer kills performance.

Posted by Johan
in reply to Richard (Rikki) Andrew Cattermole

Permalink

Johan

Posted in reply to Richard (Rikki) Andrew Cattermole

Permalink

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:

dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

void fillBP(immutable(uint*) value, uint* dest) {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}

I expected that to not do the extra loads, but it did.

I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that immutable(uint*) points to mutable (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (__monitor), combined with casting an immutable class to an array of immutable bytes.

Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it).
In the case of fillBP there are no side-effects possible between the reads, so it appears that indeed the optimization could be done. But a different thread might write to the data. I don't know how that data-race is then defined...
For the general case, side-effects are possible (e.g. a function call) so it is not possible to simply assume that immutable reference arguments never alias to other reference arguments; this complicates implementing the desired optimization.
I'm not saying it is impossible, it's just extra effort (and proof is needed).

-Johan

February 13

Re: By ref and by pointer kills performance.

Posted by Johan
in reply to Johan

Permalink

Johan

Posted in reply to Johan

Permalink

On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:

dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

void fillBP(immutable(uint*) value, uint* dest) {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}

I expected that to not do the extra loads, but it did.

I hope someone can find the link to some DConf talk (me or Andrei)

Found it: https://youtu.be/-0jcE9B5kjs?list=PL9a7lgBtSQb-YCVj96v5vn1tEXPjKOPuB&t=403

February 13

Re: By ref and by pointer kills performance.

Posted by Patrick Schluter
in reply to claptrap

Permalink

Patrick Schluter

Posted in reply to claptrap

Permalink

On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:

https://d.godbolt.org/z/oonq1drd9

void fillBP(uint* value, uint* dest)
{
    dest[0] = *value;
    dest[1] = *value;
    dest[2] = *value;
    dest[3] = *value;
}

codegen DMD -->

            push    RBP
            mov     RBP,RSP
            mov     ECX,[RSI]
            mov     [RDI],ECX
            mov     EDX,[RSI]
            mov     4[RDI],EDX
            mov     R8D,[RSI]
            mov     8[RDI],R8D
            mov     R9D,[RSI]
            mov     0Ch[RDI],R9D
            pop     RBP
            ret

codgen LDC -->

    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 4], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 8], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 12], eax
    ret

Yes, that's normal. The compiler cannot know from the declaration alone if your pointer overlaps. In C you can declare the pointers with restrict which will tell the compiler that the pointers don't overlap. I don't know why D doesn't support restrict.

February 13

Re: By ref and by pointer kills performance.

Posted by Patrick Schluter
in reply to Richard (Rikki) Andrew Cattermole

Permalink

Patrick Schluter

Posted in reply to Richard (Rikki) Andrew Cattermole

Permalink

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:
> dmd having bad codegen here isn't a surprise, that is to be expected.
>
> Now for ldc:
>
> ```d
> void fillBP(immutable(uint*) value, uint* dest) {
>      dest[0] = *value;
>      dest[1] = *value;
>      dest[2] = *value;
>      dest[3] = *value;
> }
> ```
>
> I expected that to not do the extra loads, but it did.
>
> ```d
> void fillBP(immutable(uint*) value, uint* dest) {
> 	dest[0 .. 4][] = *value;
> }
> ```
>
> And that certainly should not be doing it either.
> Even if it wasn't immutable.
>
> For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.

Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.

February 13

Re: By ref and by pointer kills performance.

Posted by ryuukk_
in reply to Johan

Permalink

ryuukk_

Posted in reply to Johan

Permalink

Wow, OOP people ruining performance for everybody, who would have thought..

That makes me worried now.. -betterC doesn't seem to change anything about optimization, or rather, lack of optimization..

February 13

Re: By ref and by pointer kills performance.

Posted by jmh530
in reply to Bruce Carneal

Permalink

jmh530

Posted in reply to Bruce Carneal

Permalink

On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:

[snip]

https://d.godbolt.org/z/z9vYndWqP

When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.

As a heads up, the LDC wiki page doesn't have restrict on it.

https://wiki.dlang.org/LDC-specific_language_changes

Does LDC's @restrict only work with pointers directly and not slices? fillRestricted2 doesn't compile (it only fails because value is a slice, not because dest is one. But fillRestricted3 compiles just fine.

void fillRestricted2(@restrict uint[] value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

void fillRestricted3(@restrict uint* value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

February 13

Re: By ref and by pointer kills performance.

Posted by jmh530
in reply to jmh530

Permalink

jmh530

Posted in reply to jmh530

Permalink

On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:

On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:

[...]

As a heads up, the LDC wiki page doesn't have restrict on it.

https://wiki.dlang.org/LDC-specific_language_changes

void fillRestricted2(@restrict uint[] value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

void fillRestricted3(@restrict uint* value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

Sorry, I should note that while fillRestricted3 compiles, It has similar codegen as fillBP.

Top | Forum index | About this forum

Forums