March 31, 2021
On Wednesday, 31 March 2021 at 04:59:08 UTC, Walter Bright wrote:
> * should be:
>
> 	lea    eax,1[eax + edi]

The lea is the exact same length as the sequence of moves, and may be harder to decode.  I fail to see how that's a win.
March 31, 2021
On 3/31/21 12:28 AM, Elronnd wrote:
> On Saturday, 27 March 2021 at 03:25:04 UTC, Walter Bright wrote:
>> 4. fast integer arithmetic is fundamental to fast code, not a mere micro-optimization. Who wants an overflow check on every pointer increment?
> 
> Dan Luu measures overflow checks as having an overall 1% performance impact for numeric-heavy c code. (https://danluu.com/integer-overflow/).  The code size impact is also very small, ~3%.
> 
> This isn't 'speculation', it's actual measurement.

Bit surprised about how you put it. Are you sure you represent what the article says? I skimmed the article just now and... 1% is literally not found in the text, and 3% is a "guesstimate" per the author. The actual measurements show much larger margins, closer to tsbockman's.
March 31, 2021
On Wednesday, 31 March 2021 at 04:08:02 UTC, Andrei Alexandrescu wrote:
> Not much to write home about. The jumps scale linearly with the number of primitive operations:
>
> https://godbolt.org/z/r3sj1T4hc

Right, but as we both know, speed doesn't necessarily scale with the number of instructions for many decades now.

Curiosity got the better of me and I played with this for a bit.

Here is my program:

https://dump.cy.md/d7b7ae5c2d15c8c0127fd96dd74909a1/main.zig

Two interesting observations:

1. The compiler (whether it's the Zig frontend or the LLVM backend) is smart about adding the checks. If it can prove that the values will never overflow, then the overflow checks aren't emitted. I had to trick it into thinking that they may overflow, when in practice they never will.

1b. The compiler is actually that aware of the checks, that in one of my attempts to get it to always emit them, it actually generated a version of the function with and without the checks, and called the unchecked version in the case where it knew that it will never overflow! Amazing!

2. After finally getting it to always generate the checks, and benchmarking the results, the difference in run time I'm seeing between ReleaseFast and ReleaseSafe is a measly 2.7%. The disassembly looks all right too: https://godbolt.org/z/3nY7Ee4ff

Personally, 2.7% is a price I'm willing to pay any day, if it helps save me from embarrassments like https://github.com/CyberShadow/btdu/issues/1 :)

March 30, 2021
On 3/30/2021 10:06 PM, Elronnd wrote:
> On Wednesday, 31 March 2021 at 04:59:08 UTC, Walter Bright wrote:
>> * should be:
>>
>>     lea    eax,1[eax + edi]
> 
> The lea is the exact same length as the sequence of moves, and may be harder to decode.  I fail to see how that's a win.

It's a win because it uses the address decoder logic which is separate from the arithmetic logic unit. This enables it to be done in parallel with the ALU.

Although not relevant for this particular example, it also doesn't need another register for the intermediate value.
March 31, 2021
On Wednesday, 31 March 2021 at 05:25:48 UTC, Walter Bright wrote:
> It's a win because it uses the address decoder logic which is separate from the arithmetic logic unit. This enables it to be done in parallel with the ALU.

Is this still true for modern CPUs?

> Although not relevant for this particular example, it also doesn't need another register for the intermediate value.

Haven't CPUs used register renaming for a long time now? It's also pretty rare to see x86_64 code that uses all registers.
March 30, 2021
On 3/30/2021 10:16 PM, Vladimir Panteleev wrote:
> 1. The compiler (whether it's the Zig frontend or the LLVM backend) is smart about adding the checks. If it can prove that the values will never overflow, then the overflow checks aren't emitted. I had to trick it into thinking that they may overflow, when in practice they never will.

The code uses hardcoded loop limits. Yes, the compiler can infer no overflow by knowing the limits of the value. In my experience, I rarely loop for a hardcoded number of times.

March 31, 2021
On Wednesday, 31 March 2021 at 05:32:16 UTC, Walter Bright wrote:
> On 3/30/2021 10:16 PM, Vladimir Panteleev wrote:
>> 1. The compiler (whether it's the Zig frontend or the LLVM backend) is smart about adding the checks. If it can prove that the values will never overflow, then the overflow checks aren't emitted. I had to trick it into thinking that they may overflow, when in practice they never will.
>
> The code uses hardcoded loop limits. Yes, the compiler can infer no overflow by knowing the limits of the value. In my experience, I rarely loop for a hardcoded number of times.

Well, this is fake artificial code, and looping a fixed number of times is just one aspect of its fakeness. If you do loop an unpredictable number of times in your real program, then you almost certainly do need the overflow check, so emitting it would be the right thing to do there :)

March 31, 2021
On 3/31/21 1:30 AM, Vladimir Panteleev wrote:
> On Wednesday, 31 March 2021 at 05:25:48 UTC, Walter Bright wrote:
>> It's a win because it uses the address decoder logic which is separate from the arithmetic logic unit. This enables it to be done in parallel with the ALU.
> 
> Is this still true for modern CPUs?

Affirmative if you consider the Nehalem modern: https://en.wikipedia.org/wiki/Address_generation_unit
March 31, 2021
On Wednesday, 31 March 2021 at 05:41:28 UTC, Andrei Alexandrescu wrote:
> Affirmative if you consider the Nehalem modern:

Um, that was released 13 years ago.

> https://en.wikipedia.org/wiki/Address_generation_unit

In the picture it still goes through the instruction decoder first, which means LEA and ADD/SHR might as well get decoded to the same microcode.

That's the thing about this whole ordeal, we don't know anything. The only thing we *can* do is benchmark. :)

March 31, 2021
On 3/31/21 1:16 AM, Vladimir Panteleev wrote:
> On Wednesday, 31 March 2021 at 04:08:02 UTC, Andrei Alexandrescu wrote:
>> Not much to write home about. The jumps scale linearly with the number of primitive operations:
>>
>> https://godbolt.org/z/r3sj1T4hc
> 
> Right, but as we both know, speed doesn't necessarily scale with the number of instructions for many decades now.

Of course, and I wasn't suggesting the contrary. If speed would simply increase by decreasing instructions retired, inliners would be much more agreesive etc. etc. But such statements need to be carefully qualified which is why I do my best to not make them in isolation. The qualification here would be... "except most of the case when it does". Instructions retired is generally a telling proxy.

> Curiosity got the better of me and I played with this for a bit.
> 
> Here is my program:
> 
> https://dump.cy.md/d7b7ae5c2d15c8c0127fd96dd74909a1/main.zig
> 
> Two interesting observations:
> 
> 1. The compiler (whether it's the Zig frontend or the LLVM backend) is smart about adding the checks. If it can prove that the values will never overflow, then the overflow checks aren't emitted. I had to trick it into thinking that they may overflow, when in practice they never will.
> 
> 1b. The compiler is actually that aware of the checks, that in one of my attempts to get it to always emit them, it actually generated a version of the function with and without the checks, and called the unchecked version in the case where it knew that it will never overflow! Amazing!
> 
> 2. After finally getting it to always generate the checks, and benchmarking the results, the difference in run time I'm seeing between ReleaseFast and ReleaseSafe is a measly 2.7%. The disassembly looks all right too: https://godbolt.org/z/3nY7Ee4ff
> 
> Personally, 2.7% is a price I'm willing to pay any day, if it helps save me from embarrassments like https://github.com/CyberShadow/btdu/issues/1 :)

That's in line with expectations for a small benchmarks. On larger applications the impact of bigger code on the instruction cache would be more detrimental. (Also the branch predictor is a limited resource so more jumps means decreased predictability of others; not sure how that compares in magnitude with the impact on instruction cache, which is a larger and more common problem.)