July 25, 2012
On 25/07/12 09:55, Dmitry Olshansky wrote:
> On 25-Jul-12 11:51, Don Clugston wrote:
>> On 25/07/12 09:37, Walter Bright wrote:
>>> On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
>>>> It's pc => address because one can first preprocess all of byte code
>>>> doing
>>>> opcode => address rewrites. But you can't do it unless taking address
>>>> of labels
>>>> is possible.
>>>
>>> All right, that's the piece that was missing.
>>>
>>> I suppose it is possible for the compiler to recognize that the
>>> opcode=>address array is invariant, and optimize it out, but that would
>>> be a novel optimization. I don't know how hard it would be.
>>>
>>>
>>>  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431
>>>
>>> Thanks!
>>
>> Another interesting optimization with "final switch" would be if each
>> case has about the same code length.
>>
>> final switch(x) {
>> case C1:  ...
>> case C2:  ...
>> case C3:  ...
>> case C4:  ...
>> }
>> then if &(case c2) - &(case C1) == &(case C3) - &(case C2)
>>
>> change it to
>> goto (&case C1) + x *( &(case c2) - &(case C1) );
>>
>> so that there is no lookup table, just a multiply.
>
> Could be interesting if some other simple progressions could be
> hardcoded, if say branches are be sorted by length.

Ooh, that's an interesting idea too.

> Also modern CPU seem
> to be exceptionally fast at skipping NOPs so a bit of padding won't hurt.

And an unconditional branch takes no time (only one 1 uop) on modern CPUs too.


July 25, 2012
On 7/25/2012 12:51 AM, Don Clugston wrote:
> so that there is no lookup table, just a multiply.

Rethinking your idea a bit...

Suppose the switch jump_address[] array was really an array of hardcoded jmp instructions, 5 bytes each:

  jmp_table:
    jmp Lcase1;
    jmp Lcase2;
    jmp Lcase3;
    ...

and then the switch(EBX) would be:

    lea EAX,jmp_table[EBX][EBX*4]
    jmp EAX

is that kick-ass or what?

(There'd be some additional complication for PIC code.)
July 25, 2012
On 25/07/12 12:11, Walter Bright wrote:
> On 7/25/2012 12:51 AM, Don Clugston wrote:
>> so that there is no lookup table, just a multiply.
>
> Rethinking your idea a bit...
>
> Suppose the switch jump_address[] array was really an array of hardcoded
> jmp instructions, 5 bytes each:
>
>    jmp_table:
>      jmp Lcase1;
>      jmp Lcase2;
>      jmp Lcase3;
>      ...
>
> and then the switch(EBX) would be:
>
>      lea EAX,jmp_table[EBX][EBX*4]
>      jmp EAX
>
> is that kick-ass or what?
>
> (There'd be some additional complication for PIC code.)

Very nice. The jumps in the jump table take effectively zero cycles. That looks quite doable.



July 25, 2012
On 25-Jul-12 15:14, Don Clugston wrote:
> On 25/07/12 12:11, Walter Bright wrote:
>> On 7/25/2012 12:51 AM, Don Clugston wrote:
>>> so that there is no lookup table, just a multiply.
>>
>> Rethinking your idea a bit...
>>
>> Suppose the switch jump_address[] array was really an array of hardcoded
>> jmp instructions, 5 bytes each:
>>
>>    jmp_table:
>>      jmp Lcase1;
>>      jmp Lcase2;
>>      jmp Lcase3;
>>      ...
>>
>> and then the switch(EBX) would be:
>>
>>      lea EAX,jmp_table[EBX][EBX*4]
>>      jmp EAX
>>
>> is that kick-ass or what?
>>
>> (There'd be some additional complication for PIC code.)
>
> Very nice. The jumps in the jump table take effectively zero cycles.
> That looks quite doable.

Looks neat. I'd more then willing to test how it affects my tiny VM in std.regex.

-- 
Dmitry Olshansky
July 25, 2012
>> OK I've taken your comments into account.
>> Now I think I finally got it right:
>>
>> mov ecx, [ebx] ; ecx = code[pc]
>> inc ebx ; pc ++
>> jmp ecx ; goto code[pc], as ecx is already a pointer
>
> Nope, ecx is an opcode, not a pointer. You need another indirection.

Man this has been frustrating to read. I understood what Dmitry was talking about over at least dozen posts ago, and that's without actually reading the article about interpreters (I did write a SNES emulator once, but it didn't use this cool technique. I did, however, have to write it in assembly because the C version was dog-slow because e.g. I couldn't capture the overflow/negative/zero flags in C.)
July 25, 2012
On 7/25/2012 4:26 AM, Dmitry Olshansky wrote:
> On 25-Jul-12 15:14, Don Clugston wrote:
>> On 25/07/12 12:11, Walter Bright wrote:
>>> On 7/25/2012 12:51 AM, Don Clugston wrote:
>>>> so that there is no lookup table, just a multiply.
>>>
>>> Rethinking your idea a bit...
>>>
>>> Suppose the switch jump_address[] array was really an array of hardcoded
>>> jmp instructions, 5 bytes each:
>>>
>>>    jmp_table:
>>>      jmp Lcase1;
>>>      jmp Lcase2;
>>>      jmp Lcase3;
>>>      ...
>>>
>>> and then the switch(EBX) would be:
>>>
>>>      lea EAX,jmp_table[EBX][EBX*4]
>>>      jmp EAX
>>>
>>> is that kick-ass or what?
>>>
>>> (There'd be some additional complication for PIC code.)
>>
>> Very nice. The jumps in the jump table take effectively zero cycles.
>> That looks quite doable.
>
> Looks neat. I'd more then willing to test how it affects my tiny VM in std.regex.
>

Is it possible you could code it up and test it using inline asm?

July 25, 2012
On 25-Jul-12 21:19, Walter Bright wrote:
> On 7/25/2012 4:26 AM, Dmitry Olshansky wrote:
>> On 25-Jul-12 15:14, Don Clugston wrote:
>>> On 25/07/12 12:11, Walter Bright wrote:
>>>> On 7/25/2012 12:51 AM, Don Clugston wrote:
>>>>> so that there is no lookup table, just a multiply.
>>>>
>>>> Rethinking your idea a bit...
>>>>
>>>> Suppose the switch jump_address[] array was really an array of
>>>> hardcoded
>>>> jmp instructions, 5 bytes each:
>>>>
>>>>    jmp_table:
>>>>      jmp Lcase1;
>>>>      jmp Lcase2;
>>>>      jmp Lcase3;
>>>>      ...
>>>>
>>>> and then the switch(EBX) would be:
>>>>
>>>>      lea EAX,jmp_table[EBX][EBX*4]
>>>>      jmp EAX
>>>>
>>>> is that kick-ass or what?
>>>>
>>>> (There'd be some additional complication for PIC code.)
>>>
>>> Very nice. The jumps in the jump table take effectively zero cycles.
>>> That looks quite doable.
>>
>> Looks neat. I'd more then willing to test how it affects my tiny VM in
>> std.regex.
>>
>
> Is it possible you could code it up and test it using inline asm?

Mm... I could try. So the trick is to add say this:
Dispatch:
asm{
      ...
     lea EAX,jmp_table[EBX][EBX*4]
     jmp EAX
jmp_table:
      jmp Lcase1;
      jmp Lcase2;
      jmp Lcase3;
}
Lcase1:
	...
	goto Dispatch
instead of current switch and replace case with labels. Sounds not half bad.

Then I could even replace that one goto Dispatch with same
     lea EAX,jmp_table[EBX][EBX*4]
     jmp EAX

I'll give it a shot. The only thing that worries me is that I will step on compiler's toes breaking his register allocation scheme (it would have to work around my inline asm).
Any tips on which spare registers to use (I guess ecx is no go, as there is 'this' pointer present) ?

-- 
Dmitry Olshansky
July 25, 2012
On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
>> Is it possible you could code it up and test it using inline asm?
>
> Mm... I could try. So the trick is to add say this:
> Dispatch:
> asm{
>        ...
>       lea EAX,jmp_table[EBX][EBX*4]
>       jmp EAX
> jmp_table:
>        jmp Lcase1;
>        jmp Lcase2;
>        jmp Lcase3;
> }
> Lcase1:
>      ...
>      goto Dispatch
> instead of current switch and replace case with labels. Sounds not half bad.
>
> Then I could even replace that one goto Dispatch with same
>       lea EAX,jmp_table[EBX][EBX*4]
>       jmp EAX
>
> I'll give it a shot. The only thing that worries me is that I will step on
> compiler's toes breaking his register allocation scheme (it would have to work
> around my inline asm).

Use the same register for both schemes, and it should then give comparable results.

> Any tips on which spare registers to use (I guess ecx is no go, as there is
> 'this' pointer present) ?

I wouldn't worry about it. EAX is good.

July 25, 2012
On 25-Jul-12 21:47, Walter Bright wrote:
> On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
>>> Is it possible you could code it up and test it using inline asm?
...


>> Any tips on which spare registers to use (I guess ecx is no go, as
>> there is
>> 'this' pointer present) ?
>
> I wouldn't worry about it. EAX is good.
>
OK. I'm almost there, here is what I have:
	//my byte code sets 8-bit to discern code/data
	 uint c = re.ir[t.pc].code - 128;
	 //no idea how to code the above in asm
	 //.code is { return x & mask; } property
	 asm{
              mov EAX, c;
              lea EAX, L_jumptable[EAX][EAX*4];
              jmp EAX;
            }
        L_jumptable:
         mixin(`asm{`
              ~ genJumpTable()
              ~ `} `);

So I have proper table generated and it all goes fine untill I get:

std\regex.d(5118): Error: undefined identifier 'L_jumptable'

-- 
Dmitry Olshansky
July 25, 2012
On 7/25/2012 12:52 PM, Dmitry Olshansky wrote:
> On 25-Jul-12 21:47, Walter Bright wrote:
>> On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
>>>> Is it possible you could code it up and test it using inline asm?
> ...
>
>
>>> Any tips on which spare registers to use (I guess ecx is no go, as
>>> there is
>>> 'this' pointer present) ?
>>
>> I wouldn't worry about it. EAX is good.
>>
> OK. I'm almost there, here is what I have:
>      //my byte code sets 8-bit to discern code/data
>       uint c = re.ir[t.pc].code - 128;
>       //no idea how to code the above in asm
>       //.code is { return x & mask; } property
>       asm{
>                mov EAX, c;
>                lea EAX, L_jumptable[EAX][EAX*4];
>                jmp EAX;
>              }
>          L_jumptable:
>           mixin(`asm{`
>                ~ genJumpTable()
>                ~ `} `);
>
> So I have proper table generated and it all goes fine untill I get:
>
> std\regex.d(5118): Error: undefined identifier 'L_jumptable'
>

I was afraid of that. You may have to approximate it by loading the address of L_jumptable into a register and adding it in instead of using the addressing mode. Then, add those extra instructions as dummies into the other path.