Jump to page: 1 24  
Page
Thread overview
February 13

I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC.

https://d.godbolt.org/z/oonq1drd9

void fillBP(uint* value, uint* dest)
{
    dest[0] = *value;
    dest[1] = *value;
    dest[2] = *value;
    dest[3] = *value;
}

codegen DMD -->

            push    RBP
            mov     RBP,RSP
            mov     ECX,[RSI]
            mov     [RDI],ECX
            mov     EDX,[RSI]
            mov     4[RDI],EDX
            mov     R8D,[RSI]
            mov     8[RDI],R8D
            mov     R9D,[RSI]
            mov     0Ch[RDI],R9D
            pop     RBP
            ret

codgen LDC -->

    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 4], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 8], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 12], eax
    ret
void fillBV(uint value, uint* dest)
{
    dest[0] = value;
    dest[1] = value;
    dest[2] = value;
    dest[3] = value;
}

codgen DMD -->

            push    RBP
            mov     RBP,RSP
            mov     [RDI],ESI
            mov     4[RDI],ESI
            mov     8[RDI],ESI
            mov     0Ch[RDI],ESI
            pop     RBP
            ret

codegen LDC -->

    movd    xmm0, edi
    pshufd  xmm0, xmm0, 0
    movdqu  xmmword ptr [rsi], xmm0
    ret

Interestingly if you do this...

void fillBP(uint* value, uint* dest)
{
    uint tmp = *value;
    dest[0] = tmp;
    dest[1] = tmp;
    dest[2] = tmp;
    dest[3] = tmp;
}

You get identical code to the by value versions. (except the load from memory)

I'm not a compiler guy so maybe there's some rationale for this that I don't know but it seems like the compiler should be able to read "*value" once and cache it.

February 13
dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

```d
void fillBP(immutable(uint*) value, uint* dest) {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}
```

I expected that to not do the extra loads, but it did.

```d
void fillBP(immutable(uint*) value, uint* dest) {
	dest[0 .. 4][] = *value;
}
```

And that certainly should not be doing it either.
Even if it wasn't immutable.

For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.
February 13

On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:

>

I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC.

https://d.godbolt.org/z/oonq1drd9

...

I'm not a compiler guy so maybe there's some rationale for this that I don't know but it seems like the compiler should be able to read "*value" once and cache it.

To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the @restrict attribute as seen here:

https://d.godbolt.org/z/z9vYndWqP

When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.

February 13

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:

>

dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

void fillBP(immutable(uint*) value, uint* dest) {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}

I expected that to not do the extra loads, but it did.

I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that immutable(uint*) points to mutable (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (__monitor), combined with casting an immutable class to an array of immutable bytes.

Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it).
In the case of fillBP there are no side-effects possible between the reads, so it appears that indeed the optimization could be done. But a different thread might write to the data. I don't know how that data-race is then defined...
For the general case, side-effects are possible (e.g. a function call) so it is not possible to simply assume that immutable reference arguments never alias to other reference arguments; this complicates implementing the desired optimization.
I'm not saying it is impossible, it's just extra effort (and proof is needed).

-Johan

February 13

On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:

>

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:

>

dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

void fillBP(immutable(uint*) value, uint* dest) {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}

I expected that to not do the extra loads, but it did.

I hope someone can find the link to some DConf talk (me or Andrei)

Found it: https://youtu.be/-0jcE9B5kjs?list=PL9a7lgBtSQb-YCVj96v5vn1tEXPjKOPuB&t=403

February 13

On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:

>

I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC.

https://d.godbolt.org/z/oonq1drd9

void fillBP(uint* value, uint* dest)
{
    dest[0] = *value;
    dest[1] = *value;
    dest[2] = *value;
    dest[3] = *value;
}

codegen DMD -->

            push    RBP
            mov     RBP,RSP
            mov     ECX,[RSI]
            mov     [RDI],ECX
            mov     EDX,[RSI]
            mov     4[RDI],EDX
            mov     R8D,[RSI]
            mov     8[RDI],R8D
            mov     R9D,[RSI]
            mov     0Ch[RDI],R9D
            pop     RBP
            ret

codgen LDC -->

    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 4], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 8], eax
    mov     eax, dword ptr [rdi]
    mov     dword ptr [rsi + 12], eax
    ret

Yes, that's normal. The compiler cannot know from the declaration alone if your pointer overlaps. In C you can declare the pointers with restrict which will tell the compiler that the pointers don't overlap. I don't know why D doesn't support restrict.

February 13
On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:
> dmd having bad codegen here isn't a surprise, that is to be expected.
>
> Now for ldc:
>
> ```d
> void fillBP(immutable(uint*) value, uint* dest) {
>      dest[0] = *value;
>      dest[1] = *value;
>      dest[2] = *value;
>      dest[3] = *value;
> }
> ```
>
> I expected that to not do the extra loads, but it did.
>
> ```d
> void fillBP(immutable(uint*) value, uint* dest) {
> 	dest[0 .. 4][] = *value;
> }
> ```
>
> And that certainly should not be doing it either.
> Even if it wasn't immutable.
>
> For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.

Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.
February 13

Wow, OOP people ruining performance for everybody, who would have thought..

That makes me worried now.. -betterC doesn't seem to change anything about optimization, or rather, lack of optimization..

February 13

On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:

>

[snip]

To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the @restrict attribute as seen here:

https://d.godbolt.org/z/z9vYndWqP

When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.

As a heads up, the LDC wiki page doesn't have restrict on it.

https://wiki.dlang.org/LDC-specific_language_changes

Does LDC's @restrict only work with pointers directly and not slices? fillRestricted2 doesn't compile (it only fails because value is a slice, not because dest is one. But fillRestricted3 compiles just fine.

void fillRestricted2(@restrict uint[] value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

void fillRestricted3(@restrict uint* value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

February 13

On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:

>

On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:

>

[...]

As a heads up, the LDC wiki page doesn't have restrict on it.

https://wiki.dlang.org/LDC-specific_language_changes

Does LDC's @restrict only work with pointers directly and not slices? fillRestricted2 doesn't compile (it only fails because value is a slice, not because dest is one. But fillRestricted3 compiles just fine.

void fillRestricted2(@restrict uint[] value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

void fillRestricted3(@restrict uint* value, uint[] dest)
{
    dest[0] = value[0];
    dest[1] = value[1];
    dest[2] = value[2];
    dest[3] = value[3];
}

Sorry, I should note that while fillRestricted3 compiles, It has similar codegen as fillBP.

« First   ‹ Prev
1 2 3 4