February 04, 2002
"Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3jgvm$2u58$2@digitaldaemon.com...
>
> "Walter" <walter@digitalmars.com> wrote in message news:a3hkt8$232q$2@digitaldaemon.com...
> > Your solution is now in the FAQ! Thanks, -Walter
>
> Great, thanks!  And I'm also glad because my MMX code works now fine with DMC.  However, I notice that the performance of my loop is about 20%
weaker
> than in the Borland or gcc cases (no external calls, only MOVQ, PXOR and POR!).  I expect this to be the same for any compiler, since they don't touch it.  Then, I tried to force an alignment to a paragraph border for
my
> assembly function, but this only made things worse by an additional 10% -
I
> guess OPTLINK knows better about alignments... :)

Alignment probably is the issue. Try putting in NOPs one at a time before your loop, and time each time.

> Is it possible that the way different runtime libraries initialize the FPU affects the MMX performance (since both MMX and FPU instructions use the same physical registers)???  There's also a slight difference between Borland and gcc generated EXEs, about 2-3% - I don't see another reason.

I can't imagine how that would affect things. If it does, please let me know!



February 04, 2002
"Walter" <walter@digitalmars.com> wrote in message news:a3knen$dig$1@digitaldaemon.com...
>> I notice that the performance of my loop is about 20%
>> weaker than in the Borland or gcc cases (no external calls, only MOVQ,
PXOR and
>> POR).
> Alignment probably is the issue. Try putting in NOPs one at a time before your loop, and time each time.

I did some testing, with very interesting results: when I specified -o+space
for the compiling of the C source files, the C code performance dropped
slightly, but the MMX loop performance is the same as in the EXEs generated
by BCC or gcc (even slightly better).  I'm really confused about this, since
NASM handles my MMX loop in the same way each time, and I called OPTLINK
directly, so that it doesn't know about requirements to do space
optimization (just in case it cares about SC's -o+space).  Even more, I got
used to the fact that the corresponding DOSX program, compiled from the same
source, runs about 5-10% slower than its Win32 counterpart, but now,
with -o+space, it runs faster!!!

I also did another test, using a source with a simple C loop, seen on one of
BCC's newsgroups some months ago:
- -o, -o+speed, -o+all: execution time is 13 seconds
- no optimization flags specified: execution time is 4 seconds
- -o+space: execution time is 3 seconds

I thought -o+all is *always* the best to use, but it proves not to be the case...  I can send you the sources for those two tests, if you want - perhaps it could help improving the optimizer?

Laurentiu


February 05, 2002
Since you said the critical loop is in the assembler code, it cannot be the optimizer. The optimizer does not affect the assembler. I bet it's alignment. Try the NOP suggestion. -Walter

"Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3mo9g$1fp3$1@digitaldaemon.com...
>
> "Walter" <walter@digitalmars.com> wrote in message news:a3knen$dig$1@digitaldaemon.com...
> >> I notice that the performance of my loop is about 20%
> >> weaker than in the Borland or gcc cases (no external calls, only MOVQ,
> PXOR and
> >> POR).
> > Alignment probably is the issue. Try putting in NOPs one at a time
before
> > your loop, and time each time.
>
> I did some testing, with very interesting results: when I
specified -o+space
> for the compiling of the C source files, the C code performance dropped slightly, but the MMX loop performance is the same as in the EXEs
generated
> by BCC or gcc (even slightly better).  I'm really confused about this,
since
> NASM handles my MMX loop in the same way each time, and I called OPTLINK directly, so that it doesn't know about requirements to do space optimization (just in case it cares about SC's -o+space).  Even more, I
got
> used to the fact that the corresponding DOSX program, compiled from the
same
> source, runs about 5-10% slower than its Win32 counterpart, but now, with -o+space, it runs faster!!!
>
> I also did another test, using a source with a simple C loop, seen on one
of
> BCC's newsgroups some months ago:
> - -o, -o+speed, -o+all: execution time is 13 seconds
> - no optimization flags specified: execution time is 4 seconds
> - -o+space: execution time is 3 seconds
>
> I thought -o+all is *always* the best to use, but it proves not to be the case...  I can send you the sources for those two tests, if you want - perhaps it could help improving the optimizer?
>
> Laurentiu
>
>


February 05, 2002
Walter schrieb...
> 
> Since you said the critical loop is in the assembler code, it cannot be the optimizer. The optimizer does not affect the assembler. I bet it's alignment. Try the NOP suggestion. -Walter

Right. The code might fit into the processor cache in one case and not in the other depending on the starting address of the critical code. Due to optimization the assembly part can move to a base address that is not optimal for caching.

Just a guess,

	Heinz
February 05, 2002
"Walter" <walter@digitalmars.com> wrote in message news:a3nvlq$2f0q$2@digitaldaemon.com...
> I bet it's alignment. Try the NOP suggestion. -Walter

You'd win the bet... almost!  It was an alignment problem, indeed,  not of the code, of the data that the MMX instructions access.  Playing with NOP only improved performance by 2%, not significant when compared to a boost from 2.5 seconds to 1.8 (execution time).

One of the operands of my intructions cannot be aligned, but the other one could.  I used an automatic vector (char p[48]), declared in main(), and passed the pointer to that.  The option "-o+all" determines p to be aligned at a 4-byte boundary, while "-o+space" makes p's alignment to be 8-byte boundary, which is vital for MMX performance.  Both BCC and GCC align automatic vectors at 8 or 16 bytes by default, so this is where the performance penalty came from!

I did more tests related to alignment in code generated by DMC and other compilers, but I will post a separate message in c++, since we're already pretty far away from the original crash of NASM generated code... :)

Many thanks for your help and suggestions!

Laurentiu


1 2
Next ›   Last »