Thread overview
How is it possible that countUntil() generates a jump-table when the hayStack is a compile time array?
Oct 01, 2022
realhet
Oct 01, 2022
H. S. Teoh
Oct 01, 2022
realhet
October 01, 2022

Hello,

I just wanted to optimize a byte -> index lookup, by using a 256 element table instead of using [1, 2, 3].countUntil(x) and I was amazed what I've found.
My solution lookup[x] was not faster at all, because LDC2 amazingly optimized the linear search to a jump table.

Anyone please can tell me, how it is possible?

I looked inside the countUntil() template and I see no static ifs for compile time optimizations.

Is it true that: The compiler notices that there is a while loop on a static compile time array and it is clever enough to solve all the calculations in compile time and generate a completely unrolled loop? Then the optimizer transforms it to the jump table and do it with "jmp rax" ?

If I'm right, this is not just std library magic, it works even with my own template functions.
I'm also getting crazy long compile times, so this gotta be the case :D

Thank You in advance.

October 01, 2022
On Sat, Oct 01, 2022 at 01:20:08PM +0000, realhet via Digitalmars-d-learn wrote:
> Hello,
> 
> I just wanted to optimize a byte -> index lookup, by using a 256
> element table instead of using [1, 2, 3].countUntil(x) and I was
> amazed what I've found.
> My solution lookup[x] was not faster at all, because LDC2 amazingly
> optimized the linear search to a jump table.
> 
> Anyone please can tell me, how it is possible?

Because it's LDC. ;-) Or more specifically, the D front-end tells the LLVM back-end that the table size is statically-known, and the back-end pattern-matched linear searches through fixed-size tables into a jump table.

Generally, I wouldn't bother with optimizing minutiae like these unless the profiler indicates the hotspot is there. Just let the optimizer do its job.  Quite often the optimizer can do a better job than a human can.


> I looked inside the countUntil() template and I see no static ifs for
> compile time optimizations.

It has nothing to do with the Phobos code, it's a pattern recognized by LDC's optimizer.


> Is it true that: The compiler notices that there is a while loop on a static compile time array and it is clever enough to solve all the calculations in compile time and generate a completely unrolled loop? Then the optimizer transforms it to the jump table and do it with "jmp rax" ?

This is quite likely the case. I've personally seen LDC optimize an entire benchmark into a single instruction, because it figured out that it could run the function at compile-time and replace the entire function body with an instruction that just loads the final value.


> If I'm right, this is not just std library magic, it works even with
> my own template functions.
> I'm also getting crazy long compile times, so this gotta be the case
> :D
[...]

Yep.


T

-- 
If it tastes good, it's probably bad for you.
October 01, 2022
On Saturday, 1 October 2022 at 13:49:12 UTC, H. S. Teoh wrote:
> On Sat, Oct 01, 2022 at 01:20:08PM +0000, realhet via Digitalmars-d-learn wrote:

It is very good to know. Thank You for the confirmation. Indeed it is really clever.

I wrote a parser only to parse the structural elements of Dlang using template string[] parameters to feed tokens for the specialized parser functions recursively(!). And it unrolled everything into jump tables. Insane :D

The only successful optimization I made was using PCMPESTRI instruction for skipping 16 bytes of text when there is no match found in a compile time constant 16 element character set.
I expected like 10x speedup because of PCMPESTRI. But I only got 2x. I thought I was competing with poor linear searches only, but little I knew...