Thread overview
Force LDC/LLVM to split a struct local (into registers)?
Sep 19
kinke
September 19
Hi,

This change increases the function's execution time by 50%:

https://github.com/CyberShadow/chunker/commit/inline-hash

To reproduce, get the code and run:

ldmd2 -Isrc -i -g -version=benchmarkChunker -O -inline -release -run src/chunker/package -N=100 Chunker

It looks like LLVM is keeping the fields on the stack, and isn't realizing that splitting the struct to move some fields into registers will result in faster code.

Is there something in the code preventing it from doing so? Or, is there a way to force it to split the struct?

September 19
On Thursday, 19 September 2019 at 21:40:11 UTC, Vladimir Panteleev wrote:
> Hi,
>
> This change increases the function's execution time by 50%:
>
> https://github.com/CyberShadow/chunker/commit/inline-hash
>
> To reproduce, get the code and run:
> [...]

Can reproduce on Win64 (after fixing up your temp file path changes ;)). As the function is pretty long, I didn't want to look at the changes in optimized IR (`-output-ll`), so I experimented a bit. The following patch restores previous performance:

@@ -374,8 +374,8 @@ struct Chunker(R)
                        foreach (_, b; buf[state.bpos .. state.bmax])
                        {
                                // slide(b)
-                               auto out_ = hash.window[hash.wpos];
-                               hash.window[hash.wpos] = b;
+                               auto out_ = state.hash.window[hash.wpos];
+                               state.hash.window[hash.wpos] = b;
                                hash.digest ^= ulong(tabout[out_].value);
                                hash.wpos++;
                                if (hash.wpos >= windowSize)
@@ -415,7 +415,8 @@ struct Chunker(R)
                                        return chunk;
                                }
                        }
-                       state.hash = hash;
+                       state.hash.wpos = hash.wpos;
+                       state.hash.digest = hash.digest;

                        auto steps = state.bmax - state.bpos;
                        if (steps > 0)

I.e., not reading from and touching the 64-bytes hash.window buffer copy in the loop, but the original buffer directly instead.
September 19
On Thursday, 19 September 2019 at 23:14:49 UTC, kinke wrote:
> Can reproduce on Win64 (after fixing up your temp file path changes ;)).

Ah, sorry about that, I had already fixed it on master.

> The following patch restores previous performance:

Thanks! That's interesting, though not really what I was trying to do. My goal was to move the "manually inlined" statements into a Hash method. I figured that if I could get the compiler to break up the Hash struct and put some fields into registers, it would have no trouble inlining a Hash method which worked with them.

For now I settled on passing all the locals to a free function:

https://github.com/CyberShadow/chunker/blob/12b5b2d543867a67db2a28b7eb70b25efef53e70/src/chunker/package.d#L399

This does inline nicely but still precludes encapsulating the logic into an "opaque" Hash struct while maintaining performance.

September 20
On Thursday, 19 September 2019 at 23:48:35 UTC, Vladimir Panteleev wrote:
> This does inline nicely but still precludes encapsulating the logic into an "opaque" Hash struct while maintaining performance.

Another thing that works is to copy a struct with only scalar types to the stack:

https://github.com/cybershadow/chunker/commit/inline-hash-2

Then it seems to be nicely get split up, and it does allow encapsulation, though with a setup and commit step.