Thread overview
D calling conventions
Sep 09, 2003
Mike Wynn
Sep 09, 2003
Walter
Sep 09, 2003
Ilya Minkov
Sep 10, 2003
Mike Wynn
Sep 10, 2003
Mike Wynn
Sep 10, 2003
Walter
Sep 12, 2003
Mike Wynn
Sep 12, 2003
Walter
Sep 12, 2003
Sean L. Palmer
September 09, 2003
Walter,

it appears that the D calling convention (even on linux) is stdcall
(last param pushed first, callee cleans up)
is there a reason for this, personally I think stdcall/pascal are a silly way to pass params, the caller should clean up the stack they allocate, makes code more robust
I've heard ppl say that stdcall is more efficient on x86, but don't see it myself, you can not optimise calls within loops.
e.g
for( int i = 0; i < someval; i++ ) {
	int b = 9*i;
	func( b, i, other, 50 );
}

can become
i :=0 ;
fr = create frame for func;
fr[2] = other;
fr[3] = 50;
jump check;
loop:
 fr[0] = 9*i;
 fr[1] = i;
 call func;
 i :=i+1;
check:
 if i < someval jump loop;
remove frame fr;

infact in this case fr[1] can be 'i'

September 09, 2003
"Mike Wynn" <mike@l8night.co.uk> wrote in message news:bjl8rq$2ccr$1@digitaldaemon.com...
> Walter,
>
> it appears that the D calling convention (even on linux) is stdcall
> (last param pushed first, callee cleans up)

No, as the last parameter is passed in a register.

> is there a reason for this, personally I think stdcall/pascal are a silly way to pass params, the caller should clean up the stack they allocate, makes code more robust

It's smaller code.

> I've heard ppl say that stdcall is more efficient on x86, but don't see
> it myself, you can not optimise calls within loops.
> e.g
> for( int i = 0; i < someval; i++ ) {
> int b = 9*i;
> func( b, i, other, 50 );
> }
>
> can become
> i :=0 ;
> fr = create frame for func;
> fr[2] = other;
> fr[3] = 50;
> jump check;
> loop:
>   fr[0] = 9*i;
>   fr[1] = i;
>   call func;
>   i :=i+1;
> check:
>   if i < someval jump loop;
> remove frame fr;
>
> infact in this case fr[1] can be 'i'

Those kinds of optimizations are possible, and if done, would make the caller cleanup superior. But my code generator doesn't do them :-(


September 09, 2003
Walter wrote:

> Those kinds of optimizations are possible, and if done, would make the
> caller cleanup superior. But my code generator doesn't do them :-(

Is there any out there which does?

-eye

September 10, 2003
Walter wrote:
> "Mike Wynn" <mike@l8night.co.uk> wrote in message
> news:bjl8rq$2ccr$1@digitaldaemon.com...
> 
>>Walter,
>>
>>it appears that the D calling convention (even on linux) is stdcall
>>(last param pushed first, callee cleans up)
> 
> 
> No, as the last parameter is passed in a register.

I assume you mean last to be pushed i.e. first
as in func (int a, int b ) a in reg, b on stack.
(so for member functions "this" is in a register).

> 
> 
>>is there a reason for this, personally I think stdcall/pascal are a
>>silly way to pass params, the caller should clean up the stack they
>>allocate, makes code more robust
> 
> 
> It's smaller code.

so you save a few sub esp's

with caller cleanup, you know how many locals and max param space in the  function will require, so only need to allocate once.
push ebp;
mov ebp, esp;
sub esp, #4*(locals+max params)

prams at [esp + (4*param number)]
locals at[esp + (4*(max param number+local num))] // locals 0..m
or [ebp - 4*local] //locals numbered 1..n (m=n-1)

mov esp, ebp; pop ebp; ret;

or to save ever having to push/pop; (I believe this is then pairable)
mov [esp-4], ebp;
mov ebp, esp;
sub esp, #4(1+max params+max locals)
...
prams at [esp + (4*param number)]
locals at[esp + (4*(max param number+local num))] // locals 0..m
or [ebp - 4*local] //locals numbered 2..n+1 (m=n-1)
....
mov esp, ebp; mov ebp, [ebp-4]; ret;

is it not quicker to do

sub esp, #12
mov [esp+8], eax;
mov [esp+4], ebx;
mov [esp], ecx;

than
push eax;
push ebx;
push ecx;

or can Pentium pair pushes ??


> 
> 
>>I've heard ppl say that stdcall is more efficient on x86, but don't see
>>it myself, you can not optimise calls within loops.
>>e.g
>>for( int i = 0; i < someval; i++ ) {
>>int b = 9*i;
>>func( b, i, other, 50 );
>>}
>>
>>can become
>>i :=0 ;
>>fr = create frame for func;
>>fr[2] = other;
>>fr[3] = 50;
>>jump check;
>>loop:
>>  fr[0] = 9*i;
>>  fr[1] = i;
>>  call func;
>>  i :=i+1;
>>check:
>>  if i < someval jump loop;
>>remove frame fr;
>>
>>infact in this case fr[1] can be 'i'
> 
> 
> Those kinds of optimizations are possible, and if done, would make the
> caller cleanup superior. But my code generator doesn't do them :-(
> 
> 

September 10, 2003
Ilya Minkov wrote:
> Walter wrote:
> 
>> Those kinds of optimizations are possible, and if done, would make the
>> caller cleanup superior. But my code generator doesn't do them :-(
> 
> 
> Is there any out there which does?
> 
> -eye
> 
I though gcc 3.2.x did ...
obviously not is compiles the loop into
push *4
call
reset esp
jump round loop again.

September 10, 2003
"Mike Wynn" <mike@l8night.co.uk> wrote in message news:bjlq0c$2ts$1@digitaldaemon.com...
> > It's smaller code.
> so you save a few sub esp's

Yup, times thousands of function calls <g>.

> or can Pentium pair pushes ??

Which works out faster flip-flops back and forth on successive Intel chip architectures :-(


September 12, 2003
Walter wrote:
> "Mike Wynn" <mike@l8night.co.uk> wrote in message
> news:bjlq0c$2ts$1@digitaldaemon.com...
> 
>>>It's smaller code.
>>
>>so you save a few sub esp's
> 
> 
> Yup, times thousands of function calls <g>.

from some basic tests I've been doing it appear that
esp:=esp-N;esp[0] := a;esp[1] := b;esp[2] := c
call X
esp:=esp+N (can be delayed i.e. lazy frame removal)
is slightly faster for C calls
but push/pop faster for D calls

interestingly D with C calls is faster than gcc 3.2.2 :)
and there is little difference D or C except in a few odd cases (not tried method calls as I can't do C param with dmd)

interestingly
int sum( int a, int b, int c ) { return a+b+c; }
is much slower than
int sum( int a, int b, int c ) { return c+b+a; }
the compiler uses the fact c is in eax and although it creates a frame it does not have to store eax only to pull it back.

one seriour speed up would be the removal of leaf function frames
in the same time it takes to do
push ebp;

you can do
mov ebx, [esp-4]
mov esi, [esp-8]

as its a leaf function [esp-N] can be used for locals and saved reg's with out moving esp and there is no need to change ebp
also as GC is pausing its not a problem having objects beyond esp, first it's a leaf func so can't call new, and if new was inlined making the function a leaf or it manipulates objects on the heap the gc wil not be called until after the return. most concurrent collectors have to wait to "catch" the thread as they return, or on backwards branch. in the former no problem, in the latter code would be put in on the backwards branch, this could do the movement of esp etc.

I believe this would spped up all those small member functionsby a huge amount, (as ebx,esi,edi can all be stored very cheaply) chances are you don't even need extra locals.

as an aside I know eax is "this" but would it not make more sense to use
a saved reg instead that way non leaf member functions do not have to save "this" to call their own methods that have return values i.e. this in ebx or edi

> 
> 
>>or can Pentium pair pushes ??
> 
> 
> Which works out faster flip-flops back and forth on successive Intel chip
> architectures :-(
> 
> 

September 12, 2003
"Mike Wynn" <mike@l8night.co.uk> wrote in message news:bjr3e7$1c9u$1@digitaldaemon.com...
> one seriour speed up would be the removal of leaf function frames
> in the same time it takes to do
> push ebp;
>
> you can do
> mov ebx, [esp-4]
> mov esi, [esp-8]
>
> as its a leaf function [esp-N] can be used for locals and saved reg's
> with out moving esp and there is no need to change ebp
> also as GC is pausing its not a problem having objects beyond esp, first
> it's a leaf func so can't call new, and if new was inlined making the
> function a leaf or it manipulates objects on the heap the gc wil not be
> called until after the return. most concurrent collectors have to wait
> to "catch" the thread as they return, or on backwards branch. in the
> former no problem, in the latter code would be put in on the backwards
> branch, this could do the movement of esp etc.
>
> I believe this would spped up all those small member functionsby a huge amount, (as ebx,esi,edi can all be stored very cheaply) chances are you don't even need extra locals.

I wish I could spend more time on the cg and implement some of these great ideas. Unfortunately, for now all I can do is just fix bugs in it.


September 12, 2003
You better be very careful with not protecting your stack frame by adjusting esp, in an environment where interrupts can happen that use the same stack (i.e. DOS, or Win32 ring 0, say, driver or kernel level).

An interrupt can come along, start using the stack right below esp, and if your proggy stored some stuff there it will be trashed.  These kinds of bugs are really hard to track down.  This bit me on the Xbox when using an intel-supplied _ftol replacement.  ;)

Sean

"Mike Wynn" <mike@l8night.co.uk> wrote in message news:bjr3e7$1c9u$1@digitaldaemon.com...
> from some basic tests I've been doing it appear that
> esp:=esp-N;esp[0] := a;esp[1] := b;esp[2] := c
> call X
> esp:=esp+N (can be delayed i.e. lazy frame removal)
> is slightly faster for C calls
> but push/pop faster for D calls
>
> interestingly D with C calls is faster than gcc 3.2.2 :)
> and there is little difference D or C except in a few odd cases (not
> tried method calls as I can't do C param with dmd)
>
> interestingly
> int sum( int a, int b, int c ) { return a+b+c; }
> is much slower than
> int sum( int a, int b, int c ) { return c+b+a; }
> the compiler uses the fact c is in eax and although it creates a frame
> it does not have to store eax only to pull it back.
>
> one seriour speed up would be the removal of leaf function frames
> in the same time it takes to do
> push ebp;
>
> you can do
> mov ebx, [esp-4]
> mov esi, [esp-8]
>
> as its a leaf function [esp-N] can be used for locals and saved reg's
> with out moving esp and there is no need to change ebp
> also as GC is pausing its not a problem having objects beyond esp, first
> it's a leaf func so can't call new, and if new was inlined making the
> function a leaf or it manipulates objects on the heap the gc wil not be
> called until after the return. most concurrent collectors have to wait
> to "catch" the thread as they return, or on backwards branch. in the
> former no problem, in the latter code would be put in on the backwards
> branch, this could do the movement of esp etc.