On Friday, 4 February 2022 at 21:13:10 UTC, Walter Bright wrote:
> The integral promotion rules came about because of how the PDP-11 instruction set worked, as C was developed on an -11. But this has carried over into modern CPUs. Consider:
void tests(short* a, short* b, short* c) { *c = *a * *b; }
0F B7 07 movzx EAX,word ptr [RDI]
66 0F AF 06 imul AX,[RSI]
66 89 02 mov [RDX],AX
C3 ret
void testi(int* a, int* b, int* c) { *c = *a * *b; }
8B 07 mov EAX,[RDI]
0F AF 06 imul EAX,[RSI]
89 02 mov [RDX],EAX
C3 ret
You're paying a 3 size byte penalty for using short arithmetic rather than int arithmetic. It's slower, too.
Larger code size is surely more stressful for the instructions cache, but the slowdown caused by this is most likely barely measurable on modern processors.
> Generally speaking, int should be used for most calculations, short and byte for storage.
(Modern CPUs have long been deliberately optimized and tuned for C semantics.)
I generally agree, but this is only valid for the regular scalar code. Autovectorizable code taking advantage of SIMD instructions looks a bit different. Consider:
void tests(short* a, short* b, short* c, int n) { while (n--) *c++ = *a++ * *b++; }
<...>
50: f3 0f 6f 04 07 movdqu (%rdi,%rax,1),%xmm0
55: f3 0f 6f 0c 06 movdqu (%rsi,%rax,1),%xmm1
5a: 66 0f d5 c1 pmullw %xmm1,%xmm0
5e: 0f 11 04 02 movups %xmm0,(%rdx,%rax,1)
62: 48 83 c0 10 add $0x10,%rax
66: 4c 39 c0 cmp %r8,%rax
69: 75 e5 jne 50 <tests+0x50>
<...>
7 instructions, which are doing 8 multiplications per inner loop iteration.
void testi(int* a, int* b, int* c, int n) { while (n--) *c++ = *a++ * *b++; }
<...>
188: f3 0f 6f 04 07 movdqu (%rdi,%rax,1),%xmm0
18d: f3 0f 6f 0c 06 movdqu (%rsi,%rax,1),%xmm1
192: 66 0f 38 40 c1 pmulld %xmm1,%xmm0
197: 0f 11 04 02 movups %xmm0,(%rdx,%rax,1)
19b: 48 83 c0 10 add $0x10,%rax
19f: 4c 39 c0 cmp %r8,%rax
1a2: 75 e4 jne 188 <testi+0x48>
<...>
7 instructions, which are doing 4 multiplications per inner loop iteration.
The code size increases really a lot, because there are large prologue and epilogue parts before and after the inner loop. But the performance improves really a lot when processing large arrays. And the 16-bit version is roughly twice faster than the 32-bit version (because each 128-bit XMM register represents either 8 shorts or 4 ints).
If we want D language to be SIMD friendly, then discouraging the use of short
and byte
types for local variables isn't the best idea.