May 31, 2023
On Wednesday, 31 May 2023 at 17:44:21 UTC, Richard (Rikki) Andrew Cattermole wrote:
> A concern here is that inline assembly is unlikely (if at all) to inline.
>
> So you're going to have to be pretty careful that what you do is actually worth the function call, because if it isn't simd, it just might not be doing enough work to justify using inline assembly.
>
> If you are able to get a backend to generate the instruction you want using regular D code, then you're good to go. As that'll inline.
>
> My general recommendation here is to not worry about specific instructions unless you really _really_ need to (very tiny percentage of code fits this, almost to the point of not being worth considering).
>
> Instead focus on making your D code communicate to the backend what you intend. Even if it doesn't do the job today, in 2 years time it could generate significantly better assembly.

Understood and agreed. I’m able to get functions to inline with no problems with GDC when there is inline-asm code in them. As you say, without that, the overhead of a call can wipe out all of the benefit and it’s pointless. I’ve written test functions that call the instruction and it all inlines perfectly with no problem interfacing register usage in a very flexible manner thanks to GCC/GDC’s superb design. And  LDC would perhaps be even better were it not for the inline-asm syntax wishlist-item mentioned earlier that means that the current LDC would require me to rewrite all the asm to not use _named_ parameters within the asm body itself. Something I’d love to fix myself within LDC, but I don’t remotely have the knowledge of compiler internals nor the general expertise.

As for worrying about individual instructions, that isn’t my goal, it’s just both a learning exercise and possibly to make the instructions available to anyone who decides that they want them, and they are assumed to have enough experience to make that decision based on performance, but I will give them a zero-overhead solution (unless D prevents me from doing so)
May 31, 2023
On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
> On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
>> On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
>>> On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
>>
>
> Ah, just followed that link. No that’s (solely?) SIMD, something I was aware of and so I’m not duplicating that as I haven’t gone near SIMD. The pext instruction would be one instruction that I attacked some time ago, and that would already be fine with ARM as there’s a pure D fallback, but maybe I can find some native ARM equivalent if I study AArch64.
>
> So no, this would be something new. Non-SIMD insns for general use. The smallest instructions might be something like andn if I can keep to zero-overhead obviously, seeing as the benefit in the instruction is so tiny anyway. But mind you I could have done with it for graphics bit twiddling manipulation code.

If you tell LDC the right cpu target, and to use optimization, IE..

"-O -mcpu=haswell"

It will use the andn instruction...

uint foo(uint a, uint b)
{
    return a & (b ^ 0xFFFFFFFF);
}

compiles to ---->

uint example.foo(uint, uint):
        andn    eax, edi, esi
        ret

So you will probably find the compiler is already doing what you want if you let it know it can target the right cpu architechre.

I've been writing asm for over 30 years, the opportunities for beating modern compilers have gotten vanishingly small for pretty much everything except for SIMD code. And tbh the differences between CPUs, ie different instruction latency on different architectures, means it's pretty much pointless to chance few percent here or there, since there's a good chance it'll be a few percent the other way on a different CPU.



June 01, 2023
On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:
> On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
>> On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
>>> On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
>>>> On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
>>>
>>
>> Ah, just followed that link. No that’s (solely?) SIMD, something I was aware of and so I’m not duplicating that as I haven’t gone near SIMD. The pext instruction would be one instruction that I attacked some time ago, and that would already be fine with ARM as there’s a pure D fallback, but maybe I can find some native ARM equivalent if I study AArch64.
>>
>> So no, this would be something new. Non-SIMD insns for general use. The smallest instructions might be something like andn if I can keep to zero-overhead obviously, seeing as the benefit in the instruction is so tiny anyway. But mind you I could have done with it for graphics bit twiddling manipulation code.
>
> If you tell LDC the right cpu target, and to use optimization, IE..
>
> "-O -mcpu=haswell"
>
> It will use the andn instruction...
>
> uint foo(uint a, uint b)
> {
>     return a & (b ^ 0xFFFFFFFF);
> }
>
> compiles to ---->
>
> uint example.foo(uint, uint):
>         andn    eax, edi, esi
>         ret
>
> So you will probably find the compiler is already doing what you want if you let it know it can target the right cpu architechre.
>
> I've been writing asm for over 30 years, the opportunities for beating modern compilers have gotten vanishingly small for pretty much everything except for SIMD code. And tbh the differences between CPUs, ie different instruction latency on different architectures, means it's pretty much pointless to chance few percent here or there, since there's a good chance it'll be a few percent the other way on a different CPU.

I couldn’t agree more. I wrote asm full time for about five years at an operating systems outfit. But my aim was to just make these instructions available with zero overhead and then if I can somehow work out how to do it make them switch over to fallbacks in pure D _still with zero overhead for the test_ which I think is damn near impossible. And when I originally thought about andn, that would be the ultimate challenge because the benefit to be had is so very small that I would absolutely have to have to have zero overhead or it’s hopeless. So I wanted to see if I could get it to inline, checking the GDC and LDC compilers’ behaviour but I haven’t been able to test for inlining in call into an imported module from outside, from another .d file. I don’t have the tools, right now, long story. abut I will do something about that when I feel better, am quite unwell right now.

As for your insight into LDC and andn. Damn, I missed that. Many thanks for your help there. It’s not the first time I’ve seen this kind of excellent performance. I haven’t been using LDC enough because I am stuffed by the lack of support for
June 01, 2023
On Wednesday, 31 May 2023 at 17:44:21 UTC, Richard (Rikki) Andrew Cattermole wrote:
>
> Instead focus on making your D code communicate to the backend what you intend. Even if it doesn't do the job today, in 2 years time it could generate significantly better assembly.

For LDC the least performance regression usually comes from any form of LDC's __ir_pure, however it becomes slower to compile on large projects (up to 50ms, which is the cost of a 1500x1500 JPEG decoding ;) ).
https://github.com/ldc-developers/ldc/issues/4388

As a reminder of what intel-intrinsics does:
  - implement the semantics of the Intel intrinsics, up to AVX (AVX2 is WIP)
  - on DMD x86/x86_64 + GDC x86_64 + LDC x86/x86_64/arm64/arm32
  - supporting a fallback for everything, even the SSE4.1 string instructions and rounding modes

Interestingly if you use AVX intrinsics even without the AVX instructions enabled, you might sometimes be able to get speedup thanks to the implicit loop unrolling.

June 01, 2023
On Thursday, 1 June 2023 at 05:26:56 UTC, Cecil Ward wrote:
>>
>> I've been writing asm for over 30 years, the opportunities for beating modern compilers have gotten vanishingly small for pretty much everything except for SIMD code. And tbh the differences between CPUs, ie different instruction latency on different architectures, means it's pretty much pointless to chance few percent here or there, since there's a good chance it'll be a few percent the other way on a different CPU.
>
> I couldn’t agree more. I wrote asm full time for about five years at an operating systems outfit.

I'll join the party as an assembly lover :). There is vanishingly few parts where it can make a big difference vs better communicating with the backend, I spent two years of my life working only on codec optimization with the Intel C++ compiler and in the end we had one bit of x86 assembly left, that one was using the EFLAGS for multiple jumps from the same op. You can also sometimes win if your algorithm fit with the exact register count but with "register renaming" I'm not even sure. Often the assembly was better than the codegen, but the spilling code the compiler insert before and after would make it worse, in addition to the lack of optimization. Big positive with asm is the build time though!
June 01, 2023
On Thursday, 1 June 2023 at 05:26:56 UTC, Cecil Ward wrote:
> On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:
>
> As for your insight into LDC and andn. Damn, I missed that. Many thanks for your help there. It’s not the first time I’ve seen this kind of excellent performance. I haven’t been using LDC enough because I am stuffed by the lack of support for

You probably already know about it but in case you dont an easy way to see what the various D compilers are doing is by using

https://d.godbolt.org/

Its recompiles and updates the disassembly as you type.
1 2
Next ›   Last »