March 31, 2021
On 3/31/21 1:46 AM, Vladimir Panteleev wrote:
> On Wednesday, 31 March 2021 at 05:41:28 UTC, Andrei Alexandrescu wrote:
>> Affirmative if you consider the Nehalem modern:
> 
> Um, that was released 13 years ago.

It carried over afaik to all subsequent Intel CPUs: https://hexus.net/tech/reviews/cpu/147440-intel-core-i9-11900k/. Sunny Cove actually adds one extra AGU.

>> https://en.wikipedia.org/wiki/Address_generation_unit
> 
> In the picture it still goes through the instruction decoder first, which means LEA and ADD/SHR might as well get decoded to the same microcode.

That's not the case. It's separate hardware.

> That's the thing about this whole ordeal, we don't know anything. The only thing we *can* do is benchmark. :)

We can Read The Fine Manual.
March 30, 2021
On 3/30/2021 10:30 PM, Vladimir Panteleev wrote:
> On Wednesday, 31 March 2021 at 05:25:48 UTC, Walter Bright wrote:
>> It's a win because it uses the address decoder logic which is separate from the arithmetic logic unit. This enables it to be done in parallel with the ALU.
> 
> Is this still true for modern CPUs?

See https://www.agner.org/optimize/optimizing_assembly.pdf page 135.


>> Although not relevant for this particular example, it also doesn't need another register for the intermediate value.
> Haven't CPUs used register renaming for a long time now? It's also pretty rare to see x86_64 code that uses all registers.

If you use a register that needs to be saved on the stack, it's going to cost.
March 31, 2021
On Wednesday, 31 March 2021 at 06:34:04 UTC, Walter Bright wrote:
> On 3/30/2021 10:30 PM, Vladimir Panteleev wrote:
>> On Wednesday, 31 March 2021 at 05:25:48 UTC, Walter Bright wrote:
>>> It's a win because it uses the address decoder logic which is separate from the arithmetic logic unit. This enables it to be done in parallel with the ALU.
>> 
>> Is this still true for modern CPUs?
>
> See https://www.agner.org/optimize/optimizing_assembly.pdf page 135.

Thanks!

It also says that LEA may be slower than ADD on some CPUs.

I wrote a small benchmark using the assembler code from a few posts ago. It takes the same time on my AMD CPU, but the ADD is indeed slower than the LEA on the old Intel CPU on the server. :) Unfortunately I don't have access to a modern Intel CPU to test.

>>> Although not relevant for this particular example, it also doesn't need another register for the intermediate value.
>> Haven't CPUs used register renaming for a long time now? It's also pretty rare to see x86_64 code that uses all registers.
>
> If you use a register that needs to be saved on the stack, it's going to cost.

Sure, but why would you do that? If I'm reading the ABI spec correctly, almost all registers belong to the callee, and don't need to be saved/restored, and there's probably little reason to call a function in the middle of such a computation and therefore save the interim value on the stack.

March 31, 2021
On Wednesday, 31 March 2021 at 05:54:43 UTC, Andrei Alexandrescu wrote:
>> In the picture it still goes through the instruction decoder first, which means LEA and ADD/SHR might as well get decoded to the same microcode.
>
> That's not the case. It's separate hardware.

Less with the talking, more with the benchmarking!

If what you say is true, then a sequence of add interleaved with lea should be faster than the equivalent sequence, but with add replacing the lea.  Benchmark code is here https://files.catbox.moe/2zzrwe.tar; on my system, the performance is identical.
March 31, 2021
On Wednesday, 31 March 2021 at 05:08:29 UTC, Andrei Alexandrescu wrote:
>
> I skimmed the article just now and... 1% is literally not found in the text, and 3% is a "guesstimate" per the author.

Look at the 'fsan ud' row of the only table.  1% is the performance penalty for 'zip', and 0% penalty for 'unzip'.

3% codesize I got from looking at binaries on my own system.  I actually forgot that that article talks about codesize at all.
March 31, 2021
On 3/30/2021 11:54 PM, Vladimir Panteleev wrote:
> On Wednesday, 31 March 2021 at 06:34:04 UTC, Walter Bright wrote:
>> On 3/30/2021 10:30 PM, Vladimir Panteleev wrote:
>>> On Wednesday, 31 March 2021 at 05:25:48 UTC, Walter Bright wrote:
>>>> It's a win because it uses the address decoder logic which is separate from the arithmetic logic unit. This enables it to be done in parallel with the ALU.
>>>
>>> Is this still true for modern CPUs?
>>
>> See https://www.agner.org/optimize/optimizing_assembly.pdf page 135.
> 
> Thanks!
> 
> It also says that LEA may be slower than ADD on some CPUs.

Slower than ADD, but not slower than multiple ADDs. DMD does not replace a mere ADD with LEA. If you also look at how LEA is used in the various examples of optimized code in the pdf, well, he uses it a lot.

> some CPUs

Code gen is generally targeted at generating code that works well on most machines.

>> If you use a register that needs to be saved on the stack, it's going to cost.
> Sure, but why would you do that?

To map as many locals into registers as possible.

> If I'm reading the ABI spec correctly, almost all registers belong to the callee, and don't need to be saved/restored, and there's probably little reason to call a function in the middle of such a computation and therefore save the interim value on the stack.

All I can say is code gen is never that simple. There are just too many rules that conflict. The combinatorial explosion means some heuristics are relied on that produce better results most of the time. I suppose a good AI research project would be to train an AI to produce better overall patterns.

But, in general,

1. LEA is faster for more than one operation
2. using fewer registers is better
3. getting locals into registers is better
4. generating fewer instructions is better
5. generating shorter instructions is better
6. jumpless code is better

None of these are *always* true. And Intel/AMD change the rules slightly with every new processor.

As for overflow checks, I am not going to post benchmarks because everyone picks at them. Every benchmark posted here by check proponents shows that overflow checks are slower. The Rust team apparently poured a lot of effort into overflow checks, and ultimately failed, as in the checks are turned off in release code. I don't see much hope in replicating their efforts.

And, once again, I reiterate that D *does* have some overflow checks that are done at compile time (i.e. are free) in the form of integral promotions and Value Range Propagation, neither of which are part of Zig or Rust.
March 31, 2021
On Wednesday, 31 March 2021 at 07:13:07 UTC, Walter Bright wrote:
> All I can say is code gen is never that simple. There are just too many rules that conflict. The combinatorial explosion means some heuristics are relied on that produce better results most of the time. I suppose a good AI research project would be to train an AI to produce better overall patterns.
>
> But, in general,
>
> 1. LEA is faster for more than one operation
> 2. using fewer registers is better
> 3. getting locals into registers is better
> 4. generating fewer instructions is better
> 5. generating shorter instructions is better
> 6. jumpless code is better

Thanks for the insight!

My personal perspective is that:

- Silicon will keep getting faster and cheaper with time

- A 7% or a 14% or even a +100% slowdown is relatively insignificant considering the overall march of progress - Moore's law, but also other factors such as the average size and complexity of programs, which will also keep increasing as people expect software to do more things, which will drown out such "one-time" slowdowns as integer overflow checks

- In the long term, people will invariably prefer programming languages which produce correct results (with less code), over programming languages whose benefit is only that they're faster.

So, it seems to me that Rust made the choice to only enable overflow checks in debug mode in order to be competitive with the programming languages of its time. I think Zig's design is the more future-proof - there will continue to be circumstances in which speed is preferable over correctness, such as video games (where an occasional wrong result is tolerable), so having distinct ReleaseFast and ReleaseSafe modes makes sense.

BTW, another data point along Rust and Zig is of course Python 3, in which all integers are BigInts (but with small numbers inlined in the value, akin to small string optimizations).

March 31, 2021
On 3/31/2021 12:31 AM, Vladimir Panteleev wrote:
> - Silicon will keep getting faster and cheaper with time
> 
> - A 7% or a 14% or even a +100% slowdown is relatively insignificant considering the overall march of progress - Moore's law, but also other factors such as the average size and complexity of programs, which will also keep increasing as people expect software to do more things, which will drown out such "one-time" slowdowns as integer overflow checks

If you're running a data center, 1% translates to millions of dollars.


> - In the long term, people will invariably prefer programming languages which produce correct results (with less code), over programming languages whose benefit is only that they're faster.

People will prefer what makes them money :-)

D's focus is on memory safety, which is far more important than integer overflow.


> So, it seems to me that Rust made the choice to only enable overflow checks in debug mode in order to be competitive with the programming languages of its time. I think Zig's design is the more future-proof - there will continue to be circumstances in which speed is preferable over correctness, such as video games (where an occasional wrong result is tolerable), so having distinct ReleaseFast and ReleaseSafe modes makes sense.

Zig doesn't do much to prevent memory corruption. Memory safety will be the focus of D for the near future.


> BTW, another data point along Rust and Zig is of course Python 3, in which all integers are BigInts (but with small numbers inlined in the value, akin to small string optimizations).

Python isn't competitive with systems programming languages.

March 31, 2021
On Wednesday, 31 March 2021 at 07:52:31 UTC, Walter Bright wrote:
> On 3/31/2021 12:31 AM, Vladimir Panteleev wrote:
>> - Silicon will keep getting faster and cheaper with time
>> 
>> - A 7% or a 14% or even a +100% slowdown is relatively insignificant considering the overall march of progress - Moore's law, but also other factors such as the average size and complexity of programs, which will also keep increasing as people expect software to do more things, which will drown out such "one-time" slowdowns as integer overflow checks
>
> If you're running a data center, 1% translates to millions of dollars.

You would think someone would have told that to all the companies running their services written in Ruby, JavaScript, etc.

Unfortunately, that hasn't been the case.

What remains the most valuable is 1) time/money not lost due to wrong results / angry customers, and  2) developer time.

>> - In the long term, people will invariably prefer programming languages which produce correct results (with less code), over programming languages whose benefit is only that they're faster.
>
> People will prefer what makes them money :-)
>
> D's focus is on memory safety, which is far more important than integer overflow.

It most definitely is.

But I think sooner or later we will get to a point where memory safety is the norm, and writing code in memory-unsafe languages would be like writing raw assembler today. So, the standard for correctness will be higher.

March 31, 2021
On Wednesday, 31 March 2021 at 04:49:01 UTC, Andrei Alexandrescu wrote:
> On 3/31/21 12:47 AM, Andrei Alexandrescu wrote:
>> On 3/31/21 12:32 AM, tsbockman wrote:
>>> On Wednesday, 31 March 2021 at 03:32:40 UTC, Andrei Alexandrescu wrote:
>>>> On 3/30/21 7:01 PM, tsbockman wrote:
>>>>> Simply flipping compiler switches (the -ftrapv and -fwrapv flags in gcc Andrei mentioned earlier) won't work, because most high performance code contains some deliberate and correct examples of wrapping overflow, signed-unsigned reinterpretation, etc.
>>>>>
>>>>> Idiomatic Zig code (probably Ada, too) does contain this information. But, the selection of "real world" open source Zig code available for testing is limited right now, since Zig hasn't stabilized the language or the standard library yet.
>>>>
>>>> That's awfully close to "No true Scotsman".
>>>
>>> Just tossing out names of fallacies isn't really very helpful if you don't explain why you think it may apply here.
>>
>> I thought it's fairly clear

Thank you for explaining anyway.

>> - the claim is non-falsifiable: if code is faster without
>> checks, it is deemed so on account of tricks.

I've never disputed at any point that unchecked code is, by nature, almost always faster than checked code - albeit often not by much. I haven't attributed unchecked code's speed advantage to "tricks" anywhere.

>> Code without checks could benefit of other, better
>> tricks, but their absence is explained by the small size of the
>> available corpus.
>
> s/Code without checks could benefit of other/Code with checks could benefit of other/

While I think it is true that "better tricks" can narrow the performance gap between checked and unchecked code, that is not at all what I was talking about at all in the paragraphs you labeled "No true Scotsman".

Consider a C++ program similar to the following D program:

/////////////////////////////////////
module app;

import std.stdio : writeln, readln;
import std.conv : parse;

N randLCG(N)() @safe
    if(is(N == int) || is(N == uint))
{
    static N state = N(211210973);
    // "Numerical Recipes" linear congruential generator:
    return (state = N(1664525) * state + N(1013904223)); // can and should wrap
}

double testDivisor(N, N divisor)(const(ulong) trials) @safe
    if(is(N == int) || is(N == uint))
{
    N count = 0;
    foreach(n; 0 .. trials)
        count += (randLCG!N() % divisor) == N(0); // can, but should *not* wrap
    return count / real(trials);
}

void main() {
    string input = readln();
    const trials = parse!ulong(input);
    writeln(testDivisor!( int, 3)(trials));
    writeln(testDivisor!(uint, 3)(trials));
}
/////////////////////////////////////

randLCG!( int, 3) requires -fwrapv and NOT -ftrapv to work as intended.
randLCG!(uint, 3) works correctly no matter what.
testDivisor!( int, 3) requires -ftrapv and NOT -fwrapv to detect unintended overflows.
testDivisor!(uint, 3) is always vulnerable to unintended overflow, with or without -ftrapv.

So, neither -ftrapv nor -fwrapv causes an idiomatic C++ program detect unintended overflows without false positives in the general case. The compiler simply doesn't have enough information available to do so, regardless of how much performance we are willing to sacrifice. Instead, the source code of a C++ program must first be modified by a real human being to make it compatible with either -ftrapv or -fwrapv (which are mutually exclusive).

The paper you linked earlier mentions this problem: "Finally often integer overflows are known to be intentional or the programmer has investigated it and determined it to be acceptable. To address these use cases while still being useful in reporting undesired integer overflows, a whitelist functionality was introduced to enable users to specify certain files or functions that should not be checked.
...
Second, our methodology for distinguishing intentional from unintentional uses of wraparound is manual and subjective. The manual effort required meant that we could only study a subset of the errors..."
(See sections 5.3 and 6.1 of https://dl.acm.org/doi/abs/10.1145/2743019)

Idiomatic Zig code already contains the information which the researchers on that paper had to manually insert for all the C/C++ code they tested. That is why my tests were limited to Zig, because I don't have the time or motivation to go and determine whether each and every potential overflow in GCC or Firefox or whatever is intentional, just so that I can benchmark them with -ftrapv enabled.