Jump to page: 1 2
Thread overview
alignment of automatic variables in DMC
Feb 05, 2002
Laurentiu Pancescu
Feb 06, 2002
Walter
Feb 06, 2002
Walter
Feb 06, 2002
Laurentiu Pancescu
Feb 06, 2002
Walter
Feb 06, 2002
Jan Knepper
Feb 07, 2002
Heinz Saathoff
Feb 07, 2002
Walter
Feb 07, 2002
Roland
Feb 08, 2002
Laurentiu Pancescu
Feb 08, 2002
Walter
Feb 08, 2002
Laurentiu Pancescu
Feb 08, 2002
Walter
Feb 08, 2002
Laurentiu Pancescu
Feb 07, 2002
Roland
Feb 07, 2002
Walter
Feb 07, 2002
Roland
Feb 07, 2002
Walter
February 05, 2002
Here's a test case:

/* test.c */
#include <stdio.h>
#include <time.h>
int main( int argc, char *argv[] )
{
  int i;
  double x, y, z;
  clock_t now;
  printf("i@%p, x@%p, y@%p, z@%p\n", &i, &x, &y, &z);
  now = clock();
  z = 0;
  for( i = 1; i < 200000000; i++ ) {
    x = i - 1;
    y = x - 1;
    y = x * y;
    z += y;
  };
  printf("%g\n", z );
  printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC);
  return 0;
}

If compiled with -o+all, the double variables are aligned at a 4 byte boundary, while -o+space makes them aligned at 8 byte boundary, leading to a significantly better performance (just try it!).  A workaround is to declare the int *after* the doubles, and still compile with -o+all.  This trick doesn't work with BCC, because it thinks it knows better, and rearranges the order of variables on the stack, so you can't avoid performance loss for BCC 5.5.1, AFAIK.

GCC seems to align almost anything, including char[] vectors, at 8 or 16 byte boundaries, so it always provides best performance.  If you have gcc, use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed.

I think it would be very nice if DMC would get smarter about this (I use an AMD Athlon - are other x86 processors less sensitive about this?), but that's up to Walter, isn't it?

Laurentiu


February 06, 2002
Thanks for tracking this down! I'll definitely look into a fix. -Walter

"Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3pc5n$2n0$2@digitaldaemon.com...
> Here's a test case:
>
> /* test.c */
> #include <stdio.h>
> #include <time.h>
> int main( int argc, char *argv[] )
> {
>   int i;
>   double x, y, z;
>   clock_t now;
>   printf("i@%p, x@%p, y@%p, z@%p\n", &i, &x, &y, &z);
>   now = clock();
>   z = 0;
>   for( i = 1; i < 200000000; i++ ) {
>     x = i - 1;
>     y = x - 1;
>     y = x * y;
>     z += y;
>   };
>   printf("%g\n", z );
>   printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC);
>   return 0;
> }
>
> If compiled with -o+all, the double variables are aligned at a 4 byte boundary, while -o+space makes them aligned at 8 byte boundary, leading to
a
> significantly better performance (just try it!).  A workaround is to
declare
> the int *after* the doubles, and still compile with -o+all.  This trick doesn't work with BCC, because it thinks it knows better, and rearranges
the
> order of variables on the stack, so you can't avoid performance loss for
BCC
> 5.5.1, AFAIK.
>
> GCC seems to align almost anything, including char[] vectors, at 8 or 16 byte boundaries, so it always provides best performance.  If you have gcc, use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed.
>
> I think it would be very nice if DMC would get smarter about this (I use
an
> AMD Athlon - are other x86 processors less sensitive about this?), but that's up to Walter, isn't it?
>
> Laurentiu
>
>


February 06, 2002
Interestingly, this makes  3:1 difference in speed on my machine. The problem, however, is it's not related to optimization. It's just the lay of how things wind up on the stack. The calling conventions specify a 4 byte aligned stack. I don't see at the moment how dynamically adjusting it to 8 bytes within a function is going to work.

-Walter

"Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3pc5n$2n0$2@digitaldaemon.com...
> Here's a test case:
>
> /* test.c */
> #include <stdio.h>
> #include <time.h>
> int main( int argc, char *argv[] )
> {
>   int i;
>   double x, y, z;
>   clock_t now;
>   printf("i@%p, x@%p, y@%p, z@%p\n", &i, &x, &y, &z);
>   now = clock();
>   z = 0;
>   for( i = 1; i < 200000000; i++ ) {
>     x = i - 1;
>     y = x - 1;
>     y = x * y;
>     z += y;
>   };
>   printf("%g\n", z );
>   printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC);
>   return 0;
> }
>
> If compiled with -o+all, the double variables are aligned at a 4 byte boundary, while -o+space makes them aligned at 8 byte boundary, leading to
a
> significantly better performance (just try it!).  A workaround is to
declare
> the int *after* the doubles, and still compile with -o+all.  This trick doesn't work with BCC, because it thinks it knows better, and rearranges
the
> order of variables on the stack, so you can't avoid performance loss for
BCC
> 5.5.1, AFAIK.
>
> GCC seems to align almost anything, including char[] vectors, at 8 or 16 byte boundaries, so it always provides best performance.  If you have gcc, use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed.
>
> I think it would be very nice if DMC would get smarter about this (I use
an
> AMD Athlon - are other x86 processors less sensitive about this?), but that's up to Walter, isn't it?
>
> Laurentiu
>
>


February 06, 2002
The speed increase is about the same factor on my Athlon (exec time 14 seconds, as opposed to 4), and, since I saw -o+space makes auto variables being aligned at 8 bytes in 2 programs that I used for testing, I assumed it was no coincidence.

I'm not very sure what you mean by "dynamically adjusting the stack to 8 bytes", so I'm sorry if the following don't match the *real* meaning of your message.

GCC doesn't seem to do any special handling inside the stack frame code, so I guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, so any called function also starts with an aligned stack?).  Doing this might break compatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC.

What I propose is to dynamically adjust the stack in each function, like in the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM syntax):

segment test public use32 class=CODE

; int test(int x)
; {
;   int t;
;   double a, b;
;   t = x + x;
;   return t;
; }

global _test
_test:
        push ebp                        ; save EBP, since we use it for
        mov ebp, esp                    ; accessing local parameters
        and esp, 0xFFFFFFF8             ; align the stack at 8 byte boundary
                                        ; (ESP normally decreases, so this
is okay)
        add esp, -24                    ; reserve space for local vars
                                        ; (compiler rearranges vars: doubles
first, then
                                        ; the int, referring to an
hypothetical push order):
                                        ; - a @ [ESP + 16]
                                        ; - b @ [ESP + 8]
                                        ; - t @ [ESP + 0] (4 bytes needed,
just alignment demo)
        mov eax, [ebp + 8]              ; EAX <- local param 'x'
        add eax, eax                    ; calculate value for 'x + x'
        mov [esp], eax                  ; 't' <- EAX
        mov esp, ebp                    ; restore the value that ESP had,
after EBP was
                                        ; pushed, but *before* alignment
        pop ebp                         ; restore EBP (LEAVE also works, but
like this is clearer)
        retn                            ; return value is in EAX, as normal

I hope that your news client won't ruin my nice NASM code formatting... :)

I think this approach is relatively unexpensive, and allows the compiler to do proper alignment for local variables, since it knows it always starts with an 8-byte aligned stack (not true for local parameters, if you're called some a non-DMC code, but oh well!).  Even more, DMC could do normal stack frame for static functions, since they can only be called from the same module, and all functions ensure that the stack is 8 byte aligned before they call any other function.  What do you think?

Laurentiu

"Walter" <walter@digitalmars.com> wrote in message news:a3qup4$26oj$1@digitaldaemon.com...
> Interestingly, this makes  3:1 difference in speed on my machine. The problem, however, is it's not related to optimization. It's just the lay
of
> how things wind up on the stack. The calling conventions specify a 4 byte aligned stack. I don't see at the moment how dynamically adjusting it to 8 bytes within a function is going to work.
>
> -Walter



February 06, 2002
The trouble is, if I align ESP, then the function can't access the passed parameters any more with a fixed ESP offset. What you're doing is accessing the parameters with EBP, and the locals with ESP. I'd thought of that, too, but it's a significant recoding of the code generator. -Walter

"Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3rnsp$2i3n$1@digitaldaemon.com...
> The speed increase is about the same factor on my Athlon (exec time 14 seconds, as opposed to 4), and, since I saw -o+space makes auto variables being aligned at 8 bytes in 2 programs that I used for testing, I assumed
it
> was no coincidence.
>
> I'm not very sure what you mean by "dynamically adjusting the stack to 8 bytes", so I'm sorry if the following don't match the *real* meaning of
your
> message.
>
> GCC doesn't seem to do any special handling inside the stack frame code,
so
> I guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, so
any
> called function also starts with an aligned stack?).  Doing this might
break
> compatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC.
>
> What I propose is to dynamically adjust the stack in each function, like
in
> the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM
> syntax):
>
> segment test public use32 class=CODE
>
> ; int test(int x)
> ; {
> ;   int t;
> ;   double a, b;
> ;   t = x + x;
> ;   return t;
> ; }
>
> global _test
> _test:
>         push ebp                        ; save EBP, since we use it for
>         mov ebp, esp                    ; accessing local parameters
>         and esp, 0xFFFFFFF8             ; align the stack at 8 byte
boundary
>                                         ; (ESP normally decreases, so this
> is okay)
>         add esp, -24                    ; reserve space for local vars
>                                         ; (compiler rearranges vars:
doubles
> first, then
>                                         ; the int, referring to an
> hypothetical push order):
>                                         ; - a @ [ESP + 16]
>                                         ; - b @ [ESP + 8]
>                                         ; - t @ [ESP + 0] (4 bytes needed,
> just alignment demo)
>         mov eax, [ebp + 8]              ; EAX <- local param 'x'
>         add eax, eax                    ; calculate value for 'x + x'
>         mov [esp], eax                  ; 't' <- EAX
>         mov esp, ebp                    ; restore the value that ESP had,
> after EBP was
>                                         ; pushed, but *before* alignment
>         pop ebp                         ; restore EBP (LEAVE also works,
but
> like this is clearer)
>         retn                            ; return value is in EAX, as
normal
>
> I hope that your news client won't ruin my nice NASM code formatting... :)
>
> I think this approach is relatively unexpensive, and allows the compiler
to
> do proper alignment for local variables, since it knows it always starts with an 8-byte aligned stack (not true for local parameters, if you're called some a non-DMC code, but oh well!).  Even more, DMC could do normal stack frame for static functions, since they can only be called from the same module, and all functions ensure that the stack is 8 byte aligned before they call any other function.  What do you think?
>
> Laurentiu



February 06, 2002
> The trouble is, if I align ESP, then the function can't access the passed parameters any more with a fixed ESP offset. What you're doing is accessing the parameters with EBP, and the locals with ESP. I'd thought of that, too, but it's a significant recoding of the code generator. -Walter

Nevertheless sounds like something you would do anyways...

Jan


February 07, 2002
I did some investigating. GCC does some fiddling so that each function starts out with an aligned stack. This option will be a bit clumsy for DMC, since I don't have control over the function calling conventions. After spending several hours not being able to get it out of my mind <g>, I figured out a way to do it that has almost no impact on generated code. I can hide nearly all the stack adjustments in code that already adds/subtracts from ESP so that once the stack is 8 byte aligned, it stays that way.

Unfortunately, this doesn't work for parameters, i.e. if you call with (double x, int y, double z) they're not going to be aligned. It also doesn't work if some foreign code calls you with a misaligned stack. Oh well. I'll email you the fix so you can try it out (it happens with -o or -o+speed).

"Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3rnsp$2i3n$1@digitaldaemon.com...
> The speed increase is about the same factor on my Athlon (exec time 14 seconds, as opposed to 4), and, since I saw -o+space makes auto variables being aligned at 8 bytes in 2 programs that I used for testing, I assumed
it
> was no coincidence.
>
> I'm not very sure what you mean by "dynamically adjusting the stack to 8 bytes", so I'm sorry if the following don't match the *real* meaning of
your
> message.
>
> GCC doesn't seem to do any special handling inside the stack frame code,
so
> I guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, so
any
> called function also starts with an aligned stack?).  Doing this might
break
> compatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC.
>
> What I propose is to dynamically adjust the stack in each function, like
in
> the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM
> syntax):
>
> segment test public use32 class=CODE
>
> ; int test(int x)
> ; {
> ;   int t;
> ;   double a, b;
> ;   t = x + x;
> ;   return t;
> ; }
>
> global _test
> _test:
>         push ebp                        ; save EBP, since we use it for
>         mov ebp, esp                    ; accessing local parameters
>         and esp, 0xFFFFFFF8             ; align the stack at 8 byte
boundary
>                                         ; (ESP normally decreases, so this
> is okay)
>         add esp, -24                    ; reserve space for local vars
>                                         ; (compiler rearranges vars:
doubles
> first, then
>                                         ; the int, referring to an
> hypothetical push order):
>                                         ; - a @ [ESP + 16]
>                                         ; - b @ [ESP + 8]
>                                         ; - t @ [ESP + 0] (4 bytes needed,
> just alignment demo)
>         mov eax, [ebp + 8]              ; EAX <- local param 'x'
>         add eax, eax                    ; calculate value for 'x + x'
>         mov [esp], eax                  ; 't' <- EAX
>         mov esp, ebp                    ; restore the value that ESP had,
> after EBP was
>                                         ; pushed, but *before* alignment
>         pop ebp                         ; restore EBP (LEAVE also works,
but
> like this is clearer)
>         retn                            ; return value is in EAX, as
normal
>
> I hope that your news client won't ruin my nice NASM code formatting... :)
>
> I think this approach is relatively unexpensive, and allows the compiler
to
> do proper alignment for local variables, since it knows it always starts with an 8-byte aligned stack (not true for local parameters, if you're called some a non-DMC code, but oh well!).  Even more, DMC could do normal stack frame for static functions, since they can only be called from the same module, and all functions ensure that the stack is 8 byte aligned before they call any other function.  What do you think?
>
> Laurentiu
>
> "Walter" <walter@digitalmars.com> wrote in message news:a3qup4$26oj$1@digitaldaemon.com...
> > Interestingly, this makes  3:1 difference in speed on my machine. The problem, however, is it's not related to optimization. It's just the lay
> of
> > how things wind up on the stack. The calling conventions specify a 4
byte
> > aligned stack. I don't see at the moment how dynamically adjusting it to
8
> > bytes within a function is going to work.
> >
> > -Walter
>
>
>


February 07, 2002
Walter schrieb...
> he trouble is, if I align ESP, then the function can't access the passed parameters any more with a fixed ESP offset. What you're doing is accessing the parameters with EBP, and the locals with ESP. I'd thought of that, too, but it's a significant recoding of the code generator. -Walter

Maybe it's not necessary to adjust EPB or ESP when you know that at
startup ESP is aligned to 8. The calling function must pass parameters
aligned, call the function (now only 4 byte aligned), create the stack
frame by saving pushing old EPB (now stack is aligned to 8 again). Now
make sure that every auto-var is aligned to 8. That's it!
Or have I missed a point?

Regards,
	Heinz
February 07, 2002
Laurentiu Pancescu a écrit :

> GCC doesn't seem to do any special handling inside the stack frame code, so I guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, so any called function also starts with an aligned stack?).  Doing this might break compatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC.
>

Why not ?
If stack starts aligned, just manage yourself it stays so.
Compiler can help:
- for parameters: if totals size of parameter is not multiple of 4 (or 8), it
can pushs some dummy byte
so that stack stays aligned.
Unaligned parameters can be slow to acces but at least, stack is aligned at
function entry.
For Pascal call convention, compiler still have to remove the dummy bytes with
add esp
- for local data, it is the same.
We can imagine all parameters are aligned (push 7 dummy byte and a significat
byte for a char parameter)
The problem is for compatibility with other modules linked with DMC.
Optimizer can do so only for function in the same module as the one currently
compiled.

> What I propose is to dynamically adjust the stack in each function, like in the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM syntax):
>

seems to me some "plaster in a wood leg"

Roland


February 07, 2002
Walter a écrit :

> Unfortunately, this doesn't work for parameters, i.e. if you call with (double x, int y, double z) they're not going to be aligned. It also doesn't work if some foreign code calls you with a misaligned stack. Oh well. I'll email you the fix so you can try it out (it happens with -o or -o+speed).
>

can i try too ?
(complicate ? i use idde !)

Roland


« First   ‹ Prev
1 2