July 24, 2012
On 7/24/2012 12:50 PM, Dmitry Olshansky wrote:
>> are the same (same as in same number of indirections).
>>
>>      switch (code[pc++])
>>
>> and
>>
>>      goto code[pc++]()
>>
>> are the same, too.
>
> It's not. Let's get to assembly then. It's an illustration as I'm no expert and
> may have made some illegal shortcuts in this listing.
>
> goto code[pc++]() is roughly:
>
> mov ecx, [ebx]
> inc ebx
> jmp [ecx]

   jmp code[ecx]


> switch(code[pc++]) is:
>
> mov ecx, [ebx]
> inc ebx
> mov ecx, [edx+ecx] // assuming jump table is at edx
> jump [ecx]

   jmp jumptable[ecx]


> If you count only jumps, then yes, the same number of indirect jumps. BUT note
> the use of extra register to point to the table & extra read of jump table
> contents. (BTW I assumed jump table address is loaded in register, a luxurious
> assumption esp. on 32bit).

You don't need an extra register for the jump table address - and if you did, you'd need it for both, as the table needs to be referenced somehow.

Addressing modes have long been "for free" in turns of runtime cost, so

   [ECX]

and

   offset[ECX]

are the same cost.


> Again, the biggest practical limitation of switches (loosing some performance
> hurts but not show stopper) is that last time I checked dmd doesn't try to merge
> equivalent jump tables.

I can't think of an example of this.


> Thus I can't put VM dispatch switch at the of each branch of main opcode switch
> (see my earlier posts) to help branch predictor. It just spawns ton of new
> tables, of course it has lower performance and wastes data cache.

Please post source code example so I understand what you mean.
July 24, 2012
On 25-Jul-12 01:40, Walter Bright wrote:
> On 7/24/2012 12:50 PM, Dmitry Olshansky wrote:
>>> are the same (same as in same number of indirections).
>>>
>>>      switch (code[pc++])
>>>
>>> and
>>>
>>>      goto code[pc++]()
>>>
>>> are the same, too.
>>
>> It's not. Let's get to assembly then. It's an illustration as I'm no
>> expert and
>> may have made some illegal shortcuts in this listing.
>>
>> goto code[pc++]() is roughly:
>>
>> mov ecx, [ebx]
>> inc ebx
>> jmp [ecx]
>
>     jmp code[ecx]
>

There is no code just jump [ecx]. The offset is included already.

>
>> switch(code[pc++]) is:
>>
>> mov ecx, [ebx]
>> inc ebx
>> mov ecx, [edx+ecx] // assuming jump table is at edx
>> jump [ecx]
>
>     jmp jumptable[ecx]

Damn, it's not the same. If in the above ecx is pc++, here it it code[pc++] i.e. needs to be loaded. The whole point. And yes, I *didn't* object that it's still 1 JUMP. I'm more focused on extra work being done.

> Please post source code example so I understand what you mean.

OK. It sure gets confusing.
Here is an article that shows the big picture of to what I want to do: http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf
It contains some advanced techniques that are out of scope of current topic but Introduction & Background are short and full of relevant facts.

And for the sample code here it is, casts are ommited for brevity.

First one is "if I had a way to have enforced tail call to function or take address of label".

Let's assume labels:

//GLOBALS
size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's DMDscript ;)

bool jitted = false;

void run(size_t pc)
{
	//so here bytecode is normal bytecode if jitted != true
	//for simplicity sake it starts with some number e.g.:
	// 0 - op_1
	// 1 - op_2, etc.
	if(!jitted)
	{
		int i=0;
		//1:1 map of each label that servers specific opcode
		static tabulated_ops[] =
			[ &L_op1, &L_op2, &L_op3, ... ];

		while(notEndOfProgram(bytecode[i]) )
		{
			size_t n = bytecode[i];
			//store real target
			bytecode[i] = tabulated_ops[n];
			//advance by some size, skipping operands etc.
			i += opSize(n);
		}
	}

	//interpret:
	goto bytecode[pc];
	//<---- never gets here
L_op1:
	//do my op1 thing
	pc += x;
//DISPATH:
	goto bytecode[pc]; // load address at pc & jump to it
L_op2:
	//do some other thing
	pc += y;
//DISPATH:

	goto bytecode[pc]; // load address at pc & jump to it
L_op3:
	//my other op, jumps back
	pc -= ...;
//DISPATH:

	goto bytecode[pc];  // load address at pc & jump to it
...
L_opN:
	...
	goto bytecode[pc];  // load address at pc & jump to it
}



Now the same thing with switches.
And before you ask -
NO! Single switch won't do as it would have successful branch prediction rate of as bad as about 2% (see the article!). The more opcode types the worse prediction is.

Anyway here it goes:

//GLOBALS
size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's DMDscript ;)

void run(size_t pc)
{
	//here we don't JIT to real addresses beforehand
	//as jump tables somehow should be good enough
	// Okay...
	
	//interpret:
	switch(bytecode[pc])
	{
	L_op1:
	case 0:
		//do my op1 thing
		pc += x;
	//DISPATCH:
	//here comes trouble - 1st of N nearly identical tables??
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
	L_op2:
	case 1:
		//do some other thing
		pc += y;
	//DISPATCH:
	//here comes trouble - 2nd table?
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
	L_op3:
	case 2:
		//my other op, jumps back
		pc -= ...;
	//DISPATCH:
	//here comes trouble - 3rd table?
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
...
	L_opN:
	case N-1:
	...
	//DISPATCH:
	//here comes trouble Nth table ... time to count overhead
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
	}//end of giant switch
}


-- 
Dmitry Olshansky
July 24, 2012
On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
> On 25-Jul-12 01:40, Walter Bright wrote:
>> On 7/24/2012 12:50 PM, Dmitry Olshansky wrote:
>>>> are the same (same as in same number of indirections).
>>>>
>>>>      switch (code[pc++])
>>>>
>>>> and
>>>>
>>>>      goto code[pc++]()
>>>>
>>>> are the same, too.
>>>
>>> It's not. Let's get to assembly then. It's an illustration as I'm no
>>> expert and
>>> may have made some illegal shortcuts in this listing.
>>>
>>> goto code[pc++]() is roughly:
>>>
>>> mov ecx, [ebx]
>>> inc ebx
>>> jmp [ecx]
>>
>>     jmp code[ecx]
>>
>
> There is no code just jump [ecx]. The offset is included already.

I'm not seeing where "code" is in the asm code you presented.

>
>>
>>> switch(code[pc++]) is:
>>>
>>> mov ecx, [ebx]
>>> inc ebx
>>> mov ecx, [edx+ecx] // assuming jump table is at edx
>>> jump [ecx]
>>
>>     jmp jumptable[ecx]
>
> Damn, it's not the same. If in the above ecx is pc++, here it it code[pc++] i.e.
> needs to be loaded. The whole point. And yes, I *didn't* object that it's still
> 1 JUMP. I'm more focused on extra work being done.

Sorry, I have no idea why it is not the same. jumptable is a static array, and so does not need to be loaded into a register. And code[] needs to be loaded from somewhere, and it looks like it's omitted from your example.



>> Please post source code example so I understand what you mean.
>
> OK. It sure gets confusing.
> Here is an article that shows the big picture of to what I want to do:
> http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf
> It contains some advanced techniques that are out of scope of current topic but
> Introduction & Background are short and full of relevant facts.
>
> And for the sample code here it is, casts are ommited for brevity.
>
> First one is "if I had a way to have enforced tail call to function or take
> address of label".
>
> Let's assume labels:
>
> //GLOBALS
> size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's DMDscript ;)
>
> bool jitted = false;
>
> void run(size_t pc)
> {
>      //so here bytecode is normal bytecode if jitted != true
>      //for simplicity sake it starts with some number e.g.:
>      // 0 - op_1
>      // 1 - op_2, etc.
>      if(!jitted)
>      {
>          int i=0;
>          //1:1 map of each label that servers specific opcode
>          static tabulated_ops[] =
>              [ &L_op1, &L_op2, &L_op3, ... ];
>
>          while(notEndOfProgram(bytecode[i]) )
>          {
>              size_t n = bytecode[i];
>              //store real target
>              bytecode[i] = tabulated_ops[n];
>              //advance by some size, skipping operands etc.
>              i += opSize(n);
>          }
>      }
>
>      //interpret:
>      goto bytecode[pc];
>      //<---- never gets here
> L_op1:
>      //do my op1 thing
>      pc += x;
> //DISPATH:
>      goto bytecode[pc]; // load address at pc & jump to it
> L_op2:
>      //do some other thing
>      pc += y;
> //DISPATH:
>
>      goto bytecode[pc]; // load address at pc & jump to it
> L_op3:
>      //my other op, jumps back
>      pc -= ...;
> //DISPATH:
>
>      goto bytecode[pc];  // load address at pc & jump to it
> ...
> L_opN:
>      ...
>      goto bytecode[pc];  // load address at pc & jump to it
> }
>
>
>
> Now the same thing with switches.
> And before you ask -
> NO! Single switch won't do as it would have successful branch prediction rate of
> as bad as about 2% (see the article!). The more opcode types the worse
> prediction is.

I see what you mean, but this could be done with final switch by the compiler, as I explained to bearophile.


>
> Anyway here it goes:
>
> //GLOBALS
> size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's DMDscript ;)
>
> void run(size_t pc)
> {
>      //here we don't JIT to real addresses beforehand
>      //as jump tables somehow should be good enough
>      // Okay...
>
>      //interpret:
>      switch(bytecode[pc])
>      {
>      L_op1:
>      case 0:
>          //do my op1 thing
>          pc += x;
>      //DISPATCH:
>      //here comes trouble - 1st of N nearly identical tables??
>          switch(bytecode[pc])
>          {
>          case 0:    goto L_op1;
>          case 1: goto L_op2;
>          ...
>          }
>      L_op2:
>      case 1:
>          //do some other thing
>          pc += y;
>      //DISPATCH:
>      //here comes trouble - 2nd table?
>          switch(bytecode[pc])
>          {
>          case 0:    goto L_op1;
>          case 1: goto L_op2;
>          ...
>          }
>      L_op3:
>      case 2:
>          //my other op, jumps back
>          pc -= ...;
>      //DISPATCH:
>      //here comes trouble - 3rd table?
>          switch(bytecode[pc])
>          {
>          case 0:    goto L_op1;
>          case 1: goto L_op2;
>          ...
>          }
> ...
>      L_opN:
>      case N-1:
>      ...
>      //DISPATCH:
>      //here comes trouble Nth table ... time to count overhead
>          switch(bytecode[pc])
>          {
>          case 0:    goto L_op1;
>          case 1: goto L_op2;
>          ...
>          }
>      }//end of giant switch
> }

I see what you mean here, too. Thanks for the explanation. It never occurred to me that one could write code like that, but I see the point, and doing jump table merging could be done fairly easily. No new language feature is required.


July 25, 2012
On 25-Jul-12 03:31, Walter Bright wrote:
> On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
>> There is no code just jump [ecx]. The offset is included already.
>
> I'm not seeing where "code" is in the asm code you presented.

> Sorry, I have no idea why it is not the same. jumptable is a static
> array, and so does not need to be loaded into a register. And code[]
> needs to be loaded from somewhere, and it looks like it's omitted from
> your example.

Code was assumed to be in ebx obviously. BTW Static array for jump table is all good and well
 but does this trick work with PIC code? And last but not least - the jump target has to be _read_ from jump table
 and then jumped to it isn't it?

OK I've taken your comments into account.
Now I think I finally got it right:

mov ecx, [ebx] ; ecx = code[pc]
inc ebx ; pc ++
jmp ecx ; goto code[pc], as ecx is already a pointer

vs

mov ecx, [ebx] ; ecx = code[pc]
inc ebx ; ; inc pc
jump jump_table[ecx]; ; switch jump to it

or in english, ommiting PC increment:
1.
read x from array
jump to it
2.
 read x from array
 read jump address from 2nd static array at offset x
 jump to it

So to sum it all up: 1 vs 2 memory accesses, same register use, same (macro) instruction count.
Still I think that not having to touch extra static table is bonus and jump jump_table[ecx] could be less efficient on some processors then plain jump ecx (need to checks this).

>> Now the same thing with switches.
>> And before you ask -
>> NO! Single switch won't do as it would have successful branch
>> prediction rate of
>> as bad as about 2% (see the article!). The more opcode types the worse
>> prediction is.
>
> I see what you mean, but this could be done with final switch by the
> compiler, as I explained to bearophile.
>
Great but I still try to show that they are less efficient, see above.


>> //GLOBALS
>> size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's
>> DMDscript ;)
>>
>> void run(size_t pc)
>> {
>>      //here we don't JIT to real addresses beforehand
>>      //as jump tables somehow should be good enough
>>      // Okay...
>>
>>      //interpret:
>>      switch(bytecode[pc])
>>      {
>>      L_op1:
>>      case 0:
[snip]
>>      L_opN:
>>      case N-1:
>>      ...
>>      //DISPATCH:
>>      //here comes trouble, Nth table ... time to count overhead
>>          switch(bytecode[pc])
>>          {
>>          case 0:    goto L_op1;
>>          case 1: goto L_op2;
>>          ...
>>          }
>>      }//end of giant switch
>> }
>
> I see what you mean here, too. Thanks for the explanation. It never
> occurred to me that one could write code like that, but I see the point,
> and doing jump table merging could be done fairly easily. No new
> language feature is required.

Superb! I actually tried the code above, generating common things with a help of string mixins, of course currently it only gets slightly slower.

Should I file an enhancement request?


-- 
Dmitry Olshansky
July 25, 2012
On 7/24/2012 10:04 PM, Dmitry Olshansky wrote:
> On 25-Jul-12 03:31, Walter Bright wrote:
>> On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
>>> There is no code just jump [ecx]. The offset is included already.
>>
>> I'm not seeing where "code" is in the asm code you presented.
>
>> Sorry, I have no idea why it is not the same. jumptable is a static
>> array, and so does not need to be loaded into a register. And code[]
>> needs to be loaded from somewhere, and it looks like it's omitted from
>> your example.
>
> Code was assumed to be in ebx obviously.

It's got to load it somehow.

> BTW Static array for jump table is all
> good and well but does this trick work with PIC code?

The jump table can be in the code segment, which is not possible for a user generated array.

> And last but not least - the jump
> target has to be _read_ from jump table
>   and then jumped to it isn't it?

And it has to be read from code[] and jumped to. No difference.


> OK I've taken your comments into account.
> Now I think I finally got it right:
>
> mov ecx, [ebx] ; ecx = code[pc]
> inc ebx ; pc ++
> jmp ecx ; goto code[pc], as ecx is already a pointer

Nope, ecx is an opcode, not a pointer. You need another indirection.

>
> vs
>
> mov ecx, [ebx] ; ecx = code[pc]
> inc ebx ; ; inc pc
> jump jump_table[ecx]; ; switch jump to it
>
> or in english, ommiting PC increment:
> 1.
> read x from array
> jump to it

It's

    pc => opcode => address

not

    pc => address

> 2.
>   read x from array
>   read jump address from 2nd static array at offset x
>   jump to it
>
> So to sum it all up: 1 vs 2 memory accesses, same register use, same (macro)
> instruction count.
> Still I think that not having to touch extra static table is bonus and jump
> jump_table[ecx] could be less efficient on some processors then plain jump ecx
> (need to checks this).

In both cases, you must pull the address out of an array and jump to it.


>>> Now the same thing with switches.
>>> And before you ask -
>>> NO! Single switch won't do as it would have successful branch
>>> prediction rate of
>>> as bad as about 2% (see the article!). The more opcode types the worse
>>> prediction is.
>>
>> I see what you mean, but this could be done with final switch by the
>> compiler, as I explained to bearophile.
>>
> Great but I still try to show that they are less efficient, see above.

No, it is the same code for each. The trouble is you're omitting one of the indirections needed for the computed goto case. You must:

   programcounter => opcode => address

2 indirections required. You keep skipping one of them!


>>> //GLOBALS
>>> size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's
>>> DMDscript ;)
>>>
>>> void run(size_t pc)
>>> {
>>>      //here we don't JIT to real addresses beforehand
>>>      //as jump tables somehow should be good enough
>>>      // Okay...
>>>
>>>      //interpret:
>>>      switch(bytecode[pc])
>>>      {
>>>      L_op1:
>>>      case 0:
> [snip]
>>>      L_opN:
>>>      case N-1:
>>>      ...
>>>      //DISPATCH:
>>>      //here comes trouble, Nth table ... time to count overhead
>>>          switch(bytecode[pc])
>>>          {
>>>          case 0:    goto L_op1;
>>>          case 1: goto L_op2;
>>>          ...
>>>          }
>>>      }//end of giant switch
>>> }
>>
>> I see what you mean here, too. Thanks for the explanation. It never
>> occurred to me that one could write code like that, but I see the point,
>> and doing jump table merging could be done fairly easily. No new
>> language feature is required.
>
> Superb! I actually tried the code above, generating common things with a help of
> string mixins, of course currently it only gets slightly slower.
>
> Should I file an enhancement request?

For the jump table merging, yes please.


July 25, 2012
On 25-Jul-12 10:11, Walter Bright wrote:
> On 7/24/2012 10:04 PM, Dmitry Olshansky wrote:
>> BTW Static array for jump table is all
>> good and well but does this trick work with PIC code?
>
> The jump table can be in the code segment, which is not possible for a
> user generated array.
>
OK, so it works - good to know.

>> And last but not least - the jump
>> target has to be _read_ from jump table
>>   and then jumped to it isn't it?
>
> And it has to be read from code[] and jumped to. No difference.

Yes, one read + one jump.

>
>> OK I've taken your comments into account.
>> Now I think I finally got it right:
>>
>> mov ecx, [ebx] ; ecx = code[pc]
>> inc ebx ; pc ++
>> jmp ecx ; goto code[pc], as ecx is already a pointer
>
> Nope, ecx is an opcode, not a pointer. You need another indirection.

Great, I think we finally got to the heart of it.
The trick is that we already pre-processed our bytecode. Now it contains real addresses. See my computed goto example again.
(even I myself made a mistake of writing [ecx] previously)

>>
>> vs
>>
>> mov ecx, [ebx] ; ecx = code[pc]
>> inc ebx ; ; inc pc
>> jump jump_table[ecx]; ; switch jump to it
>>
>> or in english, ommiting PC increment:
>> 1.
>> read x from array
>> jump to it
>
> It's
>
>      pc => opcode => address
>
> not
>
>      pc => address
>

It's pc => address because one can first preprocess all of byte code doing opcode => address rewrites. But you can't do it unless taking address of labels is possible.

> 2 indirections required. You keep skipping one of them!

That's how you make things fast! :)

>>> I see what you mean here, too. Thanks for the explanation. It never
>>> occurred to me that one could write code like that, but I see the point,
>>> and doing jump table merging could be done fairly easily. No new
>>> language feature is required.
>>
>> Superb! I actually tried the code above, generating common things with
>> a help of
>> string mixins, of course currently it only gets slightly slower.
>>
>> Should I file an enhancement request?
>
> For the jump table merging, yes please.

Done:
http://d.puremagic.com/issues/show_bug.cgi?id=8431

-- 
Dmitry Olshansky
July 25, 2012
On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
> It's pc => address because one can first preprocess all of byte code doing
> opcode => address rewrites. But you can't do it unless taking address of labels
> is possible.

All right, that's the piece that was missing.

I suppose it is possible for the compiler to recognize that the opcode=>address array is invariant, and optimize it out, but that would be a novel optimization. I don't know how hard it would be.


> Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431

Thanks!
July 25, 2012
On 25-Jul-12 11:37, Walter Bright wrote:
> On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
>> It's pc => address because one can first preprocess all of byte code
>> doing
>> opcode => address rewrites. But you can't do it unless taking address
>> of labels
>> is possible.
>
> All right, that's the piece that was missing.
>
> I suppose it is possible for the compiler to recognize that the
> opcode=>address array is invariant, and optimize it out, but that would
> be a novel optimization. I don't know how hard it would be.
>

While it may sound possible I'm certain that in most interesting cases it's not possible for compiler to do it in advance, as array of opcodes ultimately comes from some external source e.g. parsed script file.

So opcode=>address transform happens at run-time after that it indeed becomes invariant.

>
>  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431
>
> Thanks!


-- 
Dmitry Olshansky
July 25, 2012
On 25/07/12 09:37, Walter Bright wrote:
> On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
>> It's pc => address because one can first preprocess all of byte code
>> doing
>> opcode => address rewrites. But you can't do it unless taking address
>> of labels
>> is possible.
>
> All right, that's the piece that was missing.
>
> I suppose it is possible for the compiler to recognize that the
> opcode=>address array is invariant, and optimize it out, but that would
> be a novel optimization. I don't know how hard it would be.
>
>
>  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431
>
> Thanks!

Another interesting optimization with "final switch" would be if each case has about the same code length.

final switch(x) {
case C1:  ...
case C2:  ...
case C3:  ...
case C4:  ...
}
then if &(case c2) - &(case C1) == &(case C3) - &(case C2)

change it to
goto (&case C1) + x *( &(case c2) - &(case C1) );

so that there is no lookup table, just a multiply.
July 25, 2012
On 25-Jul-12 11:51, Don Clugston wrote:
> On 25/07/12 09:37, Walter Bright wrote:
>> On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
>>> It's pc => address because one can first preprocess all of byte code
>>> doing
>>> opcode => address rewrites. But you can't do it unless taking address
>>> of labels
>>> is possible.
>>
>> All right, that's the piece that was missing.
>>
>> I suppose it is possible for the compiler to recognize that the
>> opcode=>address array is invariant, and optimize it out, but that would
>> be a novel optimization. I don't know how hard it would be.
>>
>>
>>  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431
>>
>> Thanks!
>
> Another interesting optimization with "final switch" would be if each
> case has about the same code length.
>
> final switch(x) {
> case C1:  ...
> case C2:  ...
> case C3:  ...
> case C4:  ...
> }
> then if &(case c2) - &(case C1) == &(case C3) - &(case C2)
>
> change it to
> goto (&case C1) + x *( &(case c2) - &(case C1) );
>
> so that there is no lookup table, just a multiply.

Could be interesting if some other simple progressions could be hardcoded, if say branches are be sorted by length. Also modern CPU seem to be exceptionally fast at skipping NOPs so a bit of padding won't hurt.

-- 
Dmitry Olshansky