August 02, 2013
On 8/2/2013 12:57 AM, Rainer Schuetze wrote:
>> http://www.digitalmars.com/download/freecompiler.html
>
> Although my laptop got quite a bit faster overnight (I guess it was throttled
> for some reason yesterday), relative results don't change:
>
> std.algorithm -main -unittest
>
> dmc85?: 12.5 sec
> dmc857: 12.5 sec
> msc: 7 sec
>
> BTW: I usually use VS2008, but now also tried VS2010 - no difference.

The two dmc times shouldn't be the same. I see a definite improvement. Disassemble aav.obj, and look at the function aaGetRvalue. It should look like this:

?_aaGetRvalue@@YAPAXPAUAA@@PAX@Z:
                push    EBX
                mov     EBX,0Ch[ESP]
                push    ESI
                cmp     dword ptr 0Ch[ESP],0
                je      L184
                mov     EAX,0Ch[ESP]
                mov     ECX,4[EAX]
                cmp     ECX,4
                jne     L139
                mov     ESI,EBX
                and     ESI,3
                jmp short       L166
L139:           cmp     ECX,01Fh
                jne     L15E
======== note this section does not have a div instruction in it ==============
                mov     EAX,EBX
                mov     EDX,08421085h
                mov     ECX,EBX
                mul     EDX
                mov     EAX,ECX
                sub     EAX,EDX
                shr     EAX,1
                lea     EDX,[EAX][EDX]
                shr     EDX,4
                imul    EAX,EDX,01Fh
                sub     ECX,EAX
                mov     ESI,ECX
==========================================================================
                jmp short       L166
L15E:           mov     EAX,EBX
                xor     EDX,EDX
                div     ECX
                mov     ESI,EDX
L166:           mov     ECX,0Ch[ESP]
                mov     ECX,[ECX]
                mov     EDX,[ESI*4][ECX]
                test    EDX,EDX
                je      L184
L173:           cmp     4[EDX],EBX
                jne     L17E
                mov     EAX,8[EDX]
                pop     ESI
                pop     EBX
                ret
L17E:           mov     EDX,[EDX]
                test    EDX,EDX
                jne     L173
L184:           pop     ESI
                xor     EAX,EAX
                pop     EBX
                ret
August 02, 2013

On 02.08.2013 10:24, Walter Bright wrote:
> On 8/2/2013 12:57 AM, Rainer Schuetze wrote:
>>> http://www.digitalmars.com/download/freecompiler.html
>>
>> Although my laptop got quite a bit faster overnight (I guess it was
>> throttled
>> for some reason yesterday), relative results don't change:
>>
>> std.algorithm -main -unittest
>>
>> dmc85?: 12.5 sec
>> dmc857: 12.5 sec
>> msc: 7 sec
>>
>> BTW: I usually use VS2008, but now also tried VS2010 - no difference.
>
> The two dmc times shouldn't be the same. I see a definite improvement.
> Disassemble aav.obj, and look at the function aaGetRvalue. It should
> look like this:

My disassembly looks exactly the same. I don't think that a single div operation in a rather long function has a lot of impact on modern processors. I'm running an i7, according to the instruction tables by Agner Fog, the div has latency of 17-28 cycles and a reciprocal throughput of 7-17 cycles. If I estimate the latency of the asm snippet, I also get 16 cycles. And that doesn't take the additional tests and jumps into consideration.

======== note this section does not have a div instruction in it ==============
                mov     EAX,EBX
                mov     EDX,08421085h   ; latency 3
                mov     ECX,EBX
                mul     EDX             ; latency 5
                mov     EAX,ECX
                sub     EAX,EDX         ; latency 1
                shr     EAX,1           ; latency 1
                lea     EDX,[EAX][EDX]  ; latency 1
                shr     EDX,4           ; latency 1
                imul    EAX,EDX,01Fh    ; latency 3
                sub     ECX,EAX         ; latency 1
                mov     ESI,ECX
==========================================================================

August 02, 2013
On 01/08/2013 00:32, Walter Bright wrote:
> Thanks for doing this, this is good information.
>
> On 7/31/2013 2:24 PM, Rainer Schuetze wrote:
>> I have just tried yesterdays dmd to build Visual D (it builds some
>> libraries and
>> contains a few short non-compiling tasks in between):
>>
>> Debug build dmd_dmc: 23 sec, std new 43 sec
>> Debug build dmd_msc: 19 sec, std new 20 sec
>
> That makes it clear that the dmc malloc() was the dominator, not code gen.
>


It still appears that the DMC malloc is a big reason for the difference between DMC and MSVC builds when compiling the algorithm unit tests. (a very quick test suggests that changing the global new in rmem.c to call HeapAlloc instead of malloc gives a large speedup).

August 02, 2013
"Rainer Schuetze" <r.sagitario@gmx.de> wrote in message
news:ktbvam$dvf$1@digitalmars.com...
large-address-aware).
>
> This shows that removing most of the allocations was a good optimization for the dmc-Runtime, but does not have a large, but still notable impact on a faster heap implementation (the VS runtime usually maps directly to the Windows API for non-Debug builds). I suspect the backend and the optimizer do not use "new" a lot, but plain "malloc" calls, so they still suffer from the slow runtime.

On a related note, I just tried replacing the two ::malloc calls in rmem's operator new with VirtualAlloc and I get a reduction from 13 seconds to 9 seconds (compiling "dmd std\range -unittest -main") with a release build of dmd.


August 02, 2013
On 8/2/2013 2:47 AM, Rainer Schuetze wrote:
> My disassembly looks exactly the same. I don't think that a single div operation
> in a rather long function has a lot of impact on modern processors. I'm running
> an i7, according to the instruction tables by Agner Fog, the div has latency of
> 17-28 cycles and a reciprocal throughput of 7-17 cycles. If I estimate the
> latency of the asm snippet, I also get 16 cycles. And that doesn't take the
> additional tests and jumps into consideration.


I'm using an AMD FX-6100.

August 02, 2013
On 8/2/2013 8:18 AM, Daniel Murphy wrote:
> On a related note, I just tried replacing the two ::malloc calls in rmem's
> operator new with VirtualAlloc and I get a reduction from 13 seconds to 9
> seconds (compiling "dmd std\range -unittest -main") with a release build of
> dmd.

Hmm, very interesting!

August 02, 2013

On 02.08.2013 18:37, Walter Bright wrote:
> On 8/2/2013 2:47 AM, Rainer Schuetze wrote:
>> My disassembly looks exactly the same. I don't think that a single div
>> operation
>> in a rather long function has a lot of impact on modern processors.
>> I'm running
>> an i7, according to the instruction tables by Agner Fog, the div has
>> latency of
>> 17-28 cycles and a reciprocal throughput of 7-17 cycles. If I estimate
>> the
>> latency of the asm snippet, I also get 16 cycles. And that doesn't
>> take the
>> additional tests and jumps into consideration.
>
>
> I'm using an AMD FX-6100.
>

This processor seems to do a little better with the mov reg,imm operation but otherwise is similar. The DIV operation has larger worst-case latency, though (16-48 cycles).

Better to just use a power of 2 for the array sizes anyway...
August 02, 2013
On 8/2/2013 4:18 AM, Richard Webb wrote:
> It still appears that the DMC malloc is a big reason for the difference between
> DMC and MSVC builds when compiling the algorithm unit tests. (a very quick test
> suggests that changing the global new in rmem.c to call HeapAlloc instead of
> malloc gives a large speedup).


Yes, I agree, the DMC malloc is clearly a large performance problem. I had not realized this.

August 02, 2013
02-Aug-2013 20:40, Walter Bright пишет:
> On 8/2/2013 8:18 AM, Daniel Murphy wrote:
>> On a related note, I just tried replacing the two ::malloc calls in
>> rmem's
>> operator new with VirtualAlloc and I get a reduction from 13 seconds to 9
>> seconds (compiling "dmd std\range -unittest -main") with a release
>> build of
>> dmd.
>
> Hmm, very interesting!
>

Made a pull to provide an implementation of rmem.c on top of Win32 Heap API.
https://github.com/D-Programming-Language/dmd/pull/2445

Also noting that global new/delete are not reentrant already, added NO_SERIALIZE flag to save on locking/unlocking of heap.

For me this gets from 13 to 8 seconds.

-- 
Dmitry Olshansky
1 2 3
Next ›   Last »