Increasing D Compiler Speed by Over 75% (page 3)

On 8/2/2013 12:57 AM, Rainer Schuetze wrote: >> http://www.digitalmars.com/download/freecompiler.html > > Although my laptop got quite a bit faster overnight (I guess it was throttled > for some reason yesterday), relative results don't change: > > std.algorithm -main -unittest > > dmc85?: 12.5 sec > dmc857: 12.5 sec > msc: 7 sec > > BTW: I usually use VS2008, but now also tried VS2010 - no difference. The two dmc times shouldn't be the same. I see a definite improvement. Disassemble aav.obj, and look at the function aaGetRvalue. It should look like this: ?_aaGetRvalue@@YAPAXPAUAA@@PAX@Z: push EBX mov EBX,0Ch[ESP] push ESI cmp dword ptr 0Ch[ESP],0 je L184 mov EAX,0Ch[ESP] mov ECX,4[EAX] cmp ECX,4 jne L139 mov ESI,EBX and ESI,3 jmp short L166 L139: cmp ECX,01Fh jne L15E ======== note this section does not have a div instruction in it ============== mov EAX,EBX mov EDX,08421085h mov ECX,EBX mul EDX mov EAX,ECX sub EAX,EDX shr EAX,1 lea EDX,[EAX][EDX] shr EDX,4 imul EAX,EDX,01Fh sub ECX,EAX mov ESI,ECX ========================================================================== jmp short L166 L15E: mov EAX,EBX xor EDX,EDX div ECX mov ESI,EDX L166: mov ECX,0Ch[ESP] mov ECX,[ECX] mov EDX,[ESI*4][ECX] test EDX,EDX je L184 L173: cmp 4[EDX],EBX jne L17E mov EAX,8[EDX] pop ESI pop EBX ret L17E: mov EDX,[EDX] test EDX,EDX jne L173 L184: pop ESI xor EAX,EAX pop EBX ret

August 02, 2013

Re: Increasing D Compiler Speed by Over 75%

Posted by Rainer Schuetze
in reply to Walter Bright

Permalink

Rainer Schuetze

Posted in reply to Walter Bright

Permalink


On 02.08.2013 10:24, Walter Bright wrote:
> On 8/2/2013 12:57 AM, Rainer Schuetze wrote:
>>> http://www.digitalmars.com/download/freecompiler.html
>>
>> Although my laptop got quite a bit faster overnight (I guess it was
>> throttled
>> for some reason yesterday), relative results don't change:
>>
>> std.algorithm -main -unittest
>>
>> dmc85?: 12.5 sec
>> dmc857: 12.5 sec
>> msc: 7 sec
>>
>> BTW: I usually use VS2008, but now also tried VS2010 - no difference.
>
> The two dmc times shouldn't be the same. I see a definite improvement.
> Disassemble aav.obj, and look at the function aaGetRvalue. It should
> look like this:

My disassembly looks exactly the same. I don't think that a single div operation in a rather long function has a lot of impact on modern processors. I'm running an i7, according to the instruction tables by Agner Fog, the div has latency of 17-28 cycles and a reciprocal throughput of 7-17 cycles. If I estimate the latency of the asm snippet, I also get 16 cycles. And that doesn't take the additional tests and jumps into consideration.

======== note this section does not have a div instruction in it ==============
                mov     EAX,EBX
                mov     EDX,08421085h   ; latency 3
                mov     ECX,EBX
                mul     EDX             ; latency 5
                mov     EAX,ECX
                sub     EAX,EDX         ; latency 1
                shr     EAX,1           ; latency 1
                lea     EDX,[EAX][EDX]  ; latency 1
                shr     EDX,4           ; latency 1
                imul    EAX,EDX,01Fh    ; latency 3
                sub     ECX,EAX         ; latency 1
                mov     ESI,ECX
==========================================================================

On 01/08/2013 00:32, Walter Bright wrote: > Thanks for doing this, this is good information. > > On 7/31/2013 2:24 PM, Rainer Schuetze wrote: >> I have just tried yesterdays dmd to build Visual D (it builds some >> libraries and >> contains a few short non-compiling tasks in between): >> >> Debug build dmd_dmc: 23 sec, std new 43 sec >> Debug build dmd_msc: 19 sec, std new 20 sec > > That makes it clear that the dmc malloc() was the dominator, not code gen. > It still appears that the DMC malloc is a big reason for the difference between DMC and MSVC builds when compiling the algorithm unit tests. (a very quick test suggests that changing the global new in rmem.c to call HeapAlloc instead of malloc gives a large speedup).

"Rainer Schuetze" <r.sagitario@gmx.de> wrote in message news:ktbvam$dvf$1@digitalmars.com... large-address-aware). > > This shows that removing most of the allocations was a good optimization for the dmc-Runtime, but does not have a large, but still notable impact on a faster heap implementation (the VS runtime usually maps directly to the Windows API for non-Debug builds). I suspect the backend and the optimizer do not use "new" a lot, but plain "malloc" calls, so they still suffer from the slow runtime. On a related note, I just tried replacing the two ::malloc calls in rmem's operator new with VirtualAlloc and I get a reduction from 13 seconds to 9 seconds (compiling "dmd std\range -unittest -main") with a release build of dmd.

On 8/2/2013 2:47 AM, Rainer Schuetze wrote: > My disassembly looks exactly the same. I don't think that a single div operation > in a rather long function has a lot of impact on modern processors. I'm running > an i7, according to the instruction tables by Agner Fog, the div has latency of > 17-28 cycles and a reciprocal throughput of 7-17 cycles. If I estimate the > latency of the asm snippet, I also get 16 cycles. And that doesn't take the > additional tests and jumps into consideration. I'm using an AMD FX-6100.

On 8/2/2013 8:18 AM, Daniel Murphy wrote: > On a related note, I just tried replacing the two ::malloc calls in rmem's > operator new with VirtualAlloc and I get a reduction from 13 seconds to 9 > seconds (compiling "dmd std\range -unittest -main") with a release build of > dmd. Hmm, very interesting!

On 02.08.2013 18:37, Walter Bright wrote: > On 8/2/2013 2:47 AM, Rainer Schuetze wrote: >> My disassembly looks exactly the same. I don't think that a single div >> operation >> in a rather long function has a lot of impact on modern processors. >> I'm running >> an i7, according to the instruction tables by Agner Fog, the div has >> latency of >> 17-28 cycles and a reciprocal throughput of 7-17 cycles. If I estimate >> the >> latency of the asm snippet, I also get 16 cycles. And that doesn't >> take the >> additional tests and jumps into consideration. > > > I'm using an AMD FX-6100. > This processor seems to do a little better with the mov reg,imm operation but otherwise is similar. The DIV operation has larger worst-case latency, though (16-48 cycles). Better to just use a power of 2 for the array sizes anyway...

On 8/2/2013 4:18 AM, Richard Webb wrote: > It still appears that the DMC malloc is a big reason for the difference between > DMC and MSVC builds when compiling the algorithm unit tests. (a very quick test > suggests that changing the global new in rmem.c to call HeapAlloc instead of > malloc gives a large speedup). Yes, I agree, the DMC malloc is clearly a large performance problem. I had not realized this.

02-Aug-2013 20:40, Walter Bright пишет: > On 8/2/2013 8:18 AM, Daniel Murphy wrote: >> On a related note, I just tried replacing the two ::malloc calls in >> rmem's >> operator new with VirtualAlloc and I get a reduction from 13 seconds to 9 >> seconds (compiling "dmd std\range -unittest -main") with a release >> build of >> dmd. > > Hmm, very interesting! > Made a pull to provide an implementation of rmem.c on top of Win32 Heap API. https://github.com/D-Programming-Language/dmd/pull/2445 Also noting that global new/delete are not reentrant already, added NO_SERIALIZE flag to save on locking/unlocking of heap. For me this gets from 13 to 8 seconds. -- Dmitry Olshansky

Forums