February 04, 2002 Re: X32 bug??? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Laurentiu Pancescu | "Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3jgvm$2u58$2@digitaldaemon.com... > > "Walter" <walter@digitalmars.com> wrote in message news:a3hkt8$232q$2@digitaldaemon.com... > > Your solution is now in the FAQ! Thanks, -Walter > > Great, thanks! And I'm also glad because my MMX code works now fine with DMC. However, I notice that the performance of my loop is about 20% weaker > than in the Borland or gcc cases (no external calls, only MOVQ, PXOR and POR!). I expect this to be the same for any compiler, since they don't touch it. Then, I tried to force an alignment to a paragraph border for my > assembly function, but this only made things worse by an additional 10% - I > guess OPTLINK knows better about alignments... :) Alignment probably is the issue. Try putting in NOPs one at a time before your loop, and time each time. > Is it possible that the way different runtime libraries initialize the FPU affects the MMX performance (since both MMX and FPU instructions use the same physical registers)??? There's also a slight difference between Borland and gcc generated EXEs, about 2-3% - I don't see another reason. I can't imagine how that would affect things. If it does, please let me know! |
February 04, 2002 Re: X32 bug??? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:a3knen$dig$1@digitaldaemon.com... >> I notice that the performance of my loop is about 20% >> weaker than in the Borland or gcc cases (no external calls, only MOVQ, PXOR and >> POR). > Alignment probably is the issue. Try putting in NOPs one at a time before your loop, and time each time. I did some testing, with very interesting results: when I specified -o+space for the compiling of the C source files, the C code performance dropped slightly, but the MMX loop performance is the same as in the EXEs generated by BCC or gcc (even slightly better). I'm really confused about this, since NASM handles my MMX loop in the same way each time, and I called OPTLINK directly, so that it doesn't know about requirements to do space optimization (just in case it cares about SC's -o+space). Even more, I got used to the fact that the corresponding DOSX program, compiled from the same source, runs about 5-10% slower than its Win32 counterpart, but now, with -o+space, it runs faster!!! I also did another test, using a source with a simple C loop, seen on one of BCC's newsgroups some months ago: - -o, -o+speed, -o+all: execution time is 13 seconds - no optimization flags specified: execution time is 4 seconds - -o+space: execution time is 3 seconds I thought -o+all is *always* the best to use, but it proves not to be the case... I can send you the sources for those two tests, if you want - perhaps it could help improving the optimizer? Laurentiu |
February 05, 2002 Re: X32 bug??? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Laurentiu Pancescu | Since you said the critical loop is in the assembler code, it cannot be the optimizer. The optimizer does not affect the assembler. I bet it's alignment. Try the NOP suggestion. -Walter "Laurentiu Pancescu" <lpancescu@fastmail.fm> wrote in message news:a3mo9g$1fp3$1@digitaldaemon.com... > > "Walter" <walter@digitalmars.com> wrote in message news:a3knen$dig$1@digitaldaemon.com... > >> I notice that the performance of my loop is about 20% > >> weaker than in the Borland or gcc cases (no external calls, only MOVQ, > PXOR and > >> POR). > > Alignment probably is the issue. Try putting in NOPs one at a time before > > your loop, and time each time. > > I did some testing, with very interesting results: when I specified -o+space > for the compiling of the C source files, the C code performance dropped slightly, but the MMX loop performance is the same as in the EXEs generated > by BCC or gcc (even slightly better). I'm really confused about this, since > NASM handles my MMX loop in the same way each time, and I called OPTLINK directly, so that it doesn't know about requirements to do space optimization (just in case it cares about SC's -o+space). Even more, I got > used to the fact that the corresponding DOSX program, compiled from the same > source, runs about 5-10% slower than its Win32 counterpart, but now, with -o+space, it runs faster!!! > > I also did another test, using a source with a simple C loop, seen on one of > BCC's newsgroups some months ago: > - -o, -o+speed, -o+all: execution time is 13 seconds > - no optimization flags specified: execution time is 4 seconds > - -o+space: execution time is 3 seconds > > I thought -o+all is *always* the best to use, but it proves not to be the case... I can send you the sources for those two tests, if you want - perhaps it could help improving the optimizer? > > Laurentiu > > |
February 05, 2002 Re: X32 bug??? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | Walter schrieb...
>
> Since you said the critical loop is in the assembler code, it cannot be the optimizer. The optimizer does not affect the assembler. I bet it's alignment. Try the NOP suggestion. -Walter
Right. The code might fit into the processor cache in one case and not in the other depending on the starting address of the critical code. Due to optimization the assembly part can move to a base address that is not optimal for caching.
Just a guess,
Heinz
|
February 05, 2002 Re: X32 bug??? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:a3nvlq$2f0q$2@digitaldaemon.com... > I bet it's alignment. Try the NOP suggestion. -Walter You'd win the bet... almost! It was an alignment problem, indeed, not of the code, of the data that the MMX instructions access. Playing with NOP only improved performance by 2%, not significant when compared to a boost from 2.5 seconds to 1.8 (execution time). One of the operands of my intructions cannot be aligned, but the other one could. I used an automatic vector (char p[48]), declared in main(), and passed the pointer to that. The option "-o+all" determines p to be aligned at a 4-byte boundary, while "-o+space" makes p's alignment to be 8-byte boundary, which is vital for MMX performance. Both BCC and GCC align automatic vectors at 8 or 16 bytes by default, so this is where the performance penalty came from! I did more tests related to alignment in code generated by DMC and other compilers, but I will post a separate message in c++, since we're already pretty far away from the original crash of NASM generated code... :) Many thanks for your help and suggestions! Laurentiu |
Copyright © 1999-2021 by the D Language Foundation