July 25, 2012
On 25-Jul-12 23:58, Walter Bright wrote:
> On 7/25/2012 12:52 PM, Dmitry Olshansky wrote:
>> On 25-Jul-12 21:47, Walter Bright wrote:
>>> On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
>>>>> Is it possible you could code it up and test it using inline asm?
>> ...
>>
>>
>>>> Any tips on which spare registers to use (I guess ecx is no go, as
>>>> there is
>>>> 'this' pointer present) ?
>>>
>>> I wouldn't worry about it. EAX is good.
>>>
>> OK. I'm almost there, here is what I have:
>>      //my byte code sets 8-bit to discern code/data
>>       uint c = re.ir[t.pc].code - 128;
>>       //no idea how to code the above in asm
>>       //.code is { return x & mask; } property
>>       asm{
>>                mov EAX, c;
>>                lea EAX, L_jumptable[EAX][EAX*4];
>>                jmp EAX;
>>              }
>>          L_jumptable:
>>           mixin(`asm{`
>>                ~ genJumpTable()
>>                ~ `} `);
>>
>> So I have proper table generated and it all goes fine untill I get:
>>
>> std\regex.d(5118): Error: undefined identifier 'L_jumptable'
>>
>
> I was afraid of that. You may have to approximate it by loading the
> address of L_jumptable into a register and adding it in instead of using
> the addressing mode.
like this ?
	mov EDX, L_jumpable
	move EAX, EDX[EAX][EAX*4]
doesn't work. Seems like label is nonexistent anywhere but jump instruction.

Will this one do it:
	lea EAX, $[EAX+5][EAX*4];
        jmp EAX;

Compiles. Maybe I miscalculated though. Indirect jump has size of ?

>Then, add those extra instructions as dummies into
> the other path.

Ehm? Other path like what? You mean compare it with jump table that uses plain addresses? Please expand a bit.


-- 
Dmitry Olshansky
July 25, 2012
On 26-Jul-12 00:06, Dmitry Olshansky wrote:
> On 25-Jul-12 23:58, Walter Bright wrote:
>> On 7/25/2012 12:52 PM, Dmitry Olshansky wrote:
>>> On 25-Jul-12 21:47, Walter Bright wrote:
>>>> On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
>>>>>> Is it possible you could code it up and test it using inline asm?
>>> ...
>>>
>>>
>>>>> Any tips on which spare registers to use (I guess ecx is no go, as
>>>>> there is
>>>>> 'this' pointer present) ?
>>>>
>>>> I wouldn't worry about it. EAX is good.
>>>>
>>> OK. I'm almost there, here is what I have:
>>>      //my byte code sets 8-bit to discern code/data
>>>       uint c = re.ir[t.pc].code - 128;
>>>       //no idea how to code the above in asm
>>>       //.code is { return x & mask; } property
>>>       asm{
>>>                mov EAX, c;
>>>                lea EAX, L_jumptable[EAX][EAX*4];
>>>                jmp EAX;
>>>              }
>>>          L_jumptable:
>>>           mixin(`asm{`
>>>                ~ genJumpTable()
>>>                ~ `} `);
>>>
>>> So I have proper table generated and it all goes fine untill I get:
>>>
>>> std\regex.d(5118): Error: undefined identifier 'L_jumptable'
>>>
>>
>> I was afraid of that. You may have to approximate it by loading the
>> address of L_jumptable into a register and adding it in instead of using
>> the addressing mode.
> like this ?
>      mov EDX, L_jumptable
>      mov EAX, EDX[EAX][EAX*4]
Corrected typos. Still it doesn't take off.
-- 
Dmitry Olshansky
July 25, 2012
On 07/25/2012 01:06 PM, Dmitry Olshansky wrote:
> On 25-Jul-12 23:58, Walter Bright wrote:

>> I was afraid of that. You may have to approximate it by loading the
>> address of L_jumptable into a register and adding it in instead of using
>> the addressing mode.
> like this ?
> mov EDX, L_jumpable

I hope it is just that typo! :p

Ali

July 25, 2012
>>>> std\regex.d(5118): Error: undefined identifier 'L_jumptable'
>>>>
>>>
>>> I was afraid of that. You may have to approximate it by loading the
>>> address of L_jumptable into a register and adding it in instead of using
>>> the addressing mode.

I failed to load it in any register or interact with it in any way.
I think I've stalled. There has to be a way to get label address somehow, I got tired of guess and shot :(

BTW that would allow us to do computed gotos but only in inline asm.

-- 
Dmitry Olshansky
July 25, 2012
On 7/25/2012 2:55 PM, Dmitry Olshansky wrote:
>>>>> std\regex.d(5118): Error: undefined identifier 'L_jumptable'
>>>>>
>>>>
>>>> I was afraid of that. You may have to approximate it by loading the
>>>> address of L_jumptable into a register and adding it in instead of using
>>>> the addressing mode.
>
> I failed to load it in any register or interact with it in any way.
> I think I've stalled. There has to be a way to get label address somehow, I got
> tired of guess and shot :(
>
> BTW that would allow us to do computed gotos but only in inline asm.
>

How to get an address:

     call jump_table
L1:

     ...
jump_table:
     pop EAX
     jmp L1
     .. the rest of the jump table ...

Yes, it's awful, but we're just trying to take some measurements here, so it doesn't matter.

What I meant was add these extra instructions in to the switch version as dummies in order to make the extra time they take irrelevant to looking at the difference.

July 26, 2012
On 7/25/2012 10:19 AM, Walter Bright wrote:
> Is it possible you could code it up and test it using inline asm?

I dummied up some code to do it:

int test(int i)
{
    switch (i)
    {
        case 3: i += 3; break;
        case 4: i += 4; break;
        case 5: i += 5; break;
        case 6: i += 6; break;
        case 7: i += 7; break;
        case 8: i += 8; break;
        default: i += 100; break;
    }
    return i;
}

                enter   4,0
                push    EBX
                mov     -4[EBP],EAX
                mov     EBX,EAX
                sub     EBX,3
                cmp     EBX,5
                ja      L5D
                lea     ECX,_D3foo4testFiZi[01Bh][EBX*4][EBX]
                jmp     ECX
                jmp     near ptr L39
                jmp     near ptr L3F
                jmp     near ptr L45
                jmp     near ptr L4B
                jmp     near ptr L51
                jmp     near ptr L57
L39:            add     dword ptr -4[EBP],3
                jmp short       L61
L3F:            add     dword ptr -4[EBP],4
                jmp short       L61
L45:            add     dword ptr -4[EBP],5
                jmp short       L61
L4B:            add     dword ptr -4[EBP],6
                jmp short       L61
L51:            add     dword ptr -4[EBP],7
                jmp short       L61
L57:            add     dword ptr -4[EBP],8
                jmp short       L61
L5D:            add     dword ptr -4[EBP],064h
L61:            mov     EAX,-4[EBP]
                pop     EBX
                leave
                ret

===============================================
Sadly, it's significantly slower than:

                enter   4,0
                push    EBX
                mov     -4[EBP],EAX
                mov     EBX,EAX
                sub     EBX,3
                cmp     EBX,5
                ja      L3D
                jmp     dword ptr FLAT:_DATA[00h][EBX*4]
                add     dword ptr -4[EBP],3
                jmp short       L41
                add     dword ptr -4[EBP],4
                jmp short       L41
                add     dword ptr -4[EBP],5
                jmp short       L41
                add     dword ptr -4[EBP],6
                jmp short       L41
                add     dword ptr -4[EBP],7
                jmp short       L41
                add     dword ptr -4[EBP],8
                jmp short       L41
L3D:            add     dword ptr -4[EBP],064h
L41:            mov     EAX,-4[EBP]
                pop     EBX
                leave
                ret

Maybe the branch prediction logic doesn't work for JMP ECX.
July 26, 2012
On 7/25/2012 11:24 PM, Walter Bright wrote:
>                  jmp     near ptr L39
>                  jmp     near ptr L3F
>                  jmp     near ptr L45
>                  jmp     near ptr L4B
>                  jmp     near ptr L51
>                  jmp     near ptr L57

I tried replacing these with 2 byte jmp instructions, instead of 5 byte ones, but no improvement.
July 26, 2012
On 26/07/12 00:46, Walter Bright wrote:
> On 7/25/2012 2:55 PM, Dmitry Olshansky wrote:
>>>>>> std\regex.d(5118): Error: undefined identifier 'L_jumptable'
>>>>>>
>>>>>
>>>>> I was afraid of that. You may have to approximate it by loading the
>>>>> address of L_jumptable into a register and adding it in instead of
>>>>> using
>>>>> the addressing mode.
>>
>> I failed to load it in any register or interact with it in any way.
>> I think I've stalled. There has to be a way to get label address
>> somehow, I got
>> tired of guess and shot :(
>>
>> BTW that would allow us to do computed gotos but only in inline asm.
>>
>
> How to get an address:
>
>       call jump_table
> L1:
>
>       ...
> jump_table:
>       pop EAX
>       jmp L1
>       .. the rest of the jump table ...
>
> Yes, it's awful, but we're just trying to take some measurements here,
> so it doesn't matter.
>
> What I meant was add these extra instructions in to the switch version
> as dummies in order to make the extra time they take irrelevant to
> looking at the difference.

But doing that screws up the CPU"s stack prediction so badly that it will dominate the timing
At least do something like:

jump_table:
      move EAX, [ESP]
      ret


July 26, 2012
On 26-Jul-12 10:24, Walter Bright wrote:
> On 7/25/2012 10:19 AM, Walter Bright wrote:
>> Is it possible you could code it up and test it using inline asm?
>
> I dummied up some code to do it:
>
> int test(int i)
> {
>      switch (i)
>      {
>          case 3: i += 3; break;
>          case 4: i += 4; break;
>          case 5: i += 5; break;
>          case 6: i += 6; break;
>          case 7: i += 7; break;
>          case 8: i += 8; break;
>          default: i += 100; break;
>      }
>      return i;
> }
>

Do the above in loop. And more cases of course. Something around 40 should do. I'm still figuring out why my inline asm version segfaults :)

>                  enter   4,0
>                  push    EBX
>                  mov     -4[EBP],EAX
>                  mov     EBX,EAX
>                  sub     EBX,3
>                  cmp     EBX,5
>                  ja      L5D
>                  lea     ECX,_D3foo4testFiZi[01Bh][EBX*4][EBX]
>                  jmp     ECX
>                  jmp     near ptr L39
>                  jmp     near ptr L3F
>                  jmp     near ptr L45
>                  jmp     near ptr L4B
>                  jmp     near ptr L51
>                  jmp     near ptr L57
> L39:            add     dword ptr -4[EBP],3
>                  jmp short       L61
> L3F:            add     dword ptr -4[EBP],4
>                  jmp short       L61
> L45:            add     dword ptr -4[EBP],5
>                  jmp short       L61
> L4B:            add     dword ptr -4[EBP],6
>                  jmp short       L61
> L51:            add     dword ptr -4[EBP],7
>                  jmp short       L61
> L57:            add     dword ptr -4[EBP],8
>                  jmp short       L61
> L5D:            add     dword ptr -4[EBP],064h
> L61:            mov     EAX,-4[EBP]
>                  pop     EBX
>                  leave
>                  ret
>
> ===============================================
> Sadly, it's significantly slower than:
>
>                  enter   4,0
>                  push    EBX
>                  mov     -4[EBP],EAX
>                  mov     EBX,EAX
>                  sub     EBX,3
>                  cmp     EBX,5
>                  ja      L3D
>                  jmp     dword ptr FLAT:_DATA[00h][EBX*4]
>                  add     dword ptr -4[EBP],3
>                  jmp short       L41
>                  add     dword ptr -4[EBP],4
>                  jmp short       L41
>                  add     dword ptr -4[EBP],5
>                  jmp short       L41
>                  add     dword ptr -4[EBP],6
>                  jmp short       L41
>                  add     dword ptr -4[EBP],7
>                  jmp short       L41
>                  add     dword ptr -4[EBP],8
>                  jmp short       L41
> L3D:            add     dword ptr -4[EBP],064h
> L41:            mov     EAX,-4[EBP]
>                  pop     EBX
>                  leave
>                  ret
>
> Maybe the branch prediction logic doesn't work for JMP ECX.


-- 
Dmitry Olshansky
July 26, 2012
On 26-Jul-12 11:56, Don Clugston wrote:
> But doing that screws up the CPU"s stack prediction so badly that it
> will dominate the timing
> At least do something like:
>
> jump_table:
>        move EAX, [ESP]
>        ret
>

BTW this seems to be a roundabout way to get address of label
that I can use do a threaded code interpreter.
Basically every branch is:

L_opcode_1:
	asm { mov EAX, [ESP];
		ret;
	}
	... real code here
L_opcode_2:
	asm { mov EAX, [ESP];
		ret;
	}
	... real code here

Then "compile" step looks like this:
while(not_last_opcode(code[pc]){
	size_t c = code[pc];
	switch(c){
	case op_1:
		asm{ call L_opcode_1; add EAX, 4; mov c, EAX; }
		break;
	case op_2:
		...
	}
	code[pc] = c; //now we have label address
	pc++;
}

//interpret:
pc = 0;
size_t op = code[pc];
asm { mov EAX, op; jump eax; } //here we go, almost computed goto

Obviously it's backwards and awful. Makes me wonder why can't we take it directly, what's limitation ?
How about allowing it, at least in inline assembly?


-- 
Dmitry Olshansky