September 05, 2012
On Sep 5, 2012, at 8:08 AM, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
> 
> Array literals are not so easy to fix.  I once thought that it would be optimal to make it a stack initialisation given that all values are known at compile time, this infact caused many strange SEGV's in quite a few of my programs  (most are parsers / interpreters, so things that go down *heavy* nested into itself, and it was under these circumstances that array literals on the stack would go corrupt in one way or another causing *huge* errors in perfectly sound code).

It sounds like your code has escaping references?  I think the presence of a GC tends to eliminate a lot of thought about data ownership.  This is usually beneficial in that maintaining ownership rules tends to be a huge pain, but then it also tends to avoid issues like this.
September 05, 2012
On Wednesday, 5 September 2012 at 11:03:03 UTC, Benjamin Thaut wrote:
> I rewrote a 3d game I created during my studies with D 2.0 to manual memory mangement. If I'm not studying I'm working in the 3d Engine deparement of Havok. As I needed to pratice manual memory management and did want to get rid of the GC in D for quite some time, I did go through all this effort to create a GC free version of my game.
>
> The results are:
>
>     DMD GC Version: 71 FPS, 14.0 ms frametime
>     GDC GC Version: 128.6 FPS, 7.72 ms frametime
>     DMD MMM Version: 142.8 FPS, 7.02 ms frametime
>
> GC collection times:
>
>     DMD GC Version: 8.9 ms
>     GDC GC Version: 4.1 ms
>
> As you see the manual managed version is twice as fast as the garbage collected one. Even the highly optimized version created with GDC is still slower the the manual memory management.
>
> You can find the full article at:
>
> http://3d.benjamin-thaut.de/?p=20#more-20
>
>
> Feedback is welcome.
>
> Kind Regards
> Benjamin Thaut

Did you try GC.disable/enable?
September 05, 2012
On 9/5/2012 4:03 AM, Benjamin Thaut wrote:
> GC collection times:
>
>      DMD GC Version: 8.9 ms
>      GDC GC Version: 4.1 ms

I'd like it if you could add some instrumentation to see what accounts for the time difference. I presume they both use the same D source code.
September 05, 2012
On 6 September 2012 00:10, Walter Bright <newshound2@digitalmars.com> wrote:
> On 9/5/2012 4:03 AM, Benjamin Thaut wrote:
>>
>> GC collection times:
>>
>>      DMD GC Version: 8.9 ms
>>      GDC GC Version: 4.1 ms
>
>
> I'd like it if you could add some instrumentation to see what accounts for the time difference. I presume they both use the same D source code.

I'd say they are identical, but I don't really look at what goes on over on the MinGW port.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
September 05, 2012
On 9/6/12, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
> I'd say they are identical, but I don't really look at what goes on over on the MinGW port.

Speaking of which, I'd like to see if the Unilink linker would make any difference as well. It's known to make smaller binaries than Optlink. I think Unilink could be tested with MinGW if it supports whatever GDC outputs, to compare against LD.
September 06, 2012
Walter Bright:

> I'd like it if you could add some instrumentation to see what accounts for the time difference. I presume they both use the same D source code.

Maybe that performance difference comes from the sum of some metric tons of different little optimizations done by the GCC back-end.

Bye,
bearophile
September 06, 2012
On 9/5/2012 5:01 PM, bearophile wrote:
> Walter Bright:
>
>> I'd like it if you could add some instrumentation to see what accounts for the
>> time difference. I presume they both use the same D source code.
>
> Maybe that performance difference comes from the sum of some metric tons of
> different little optimizations done by the GCC back-end.

We can trade guesses all day, and not get anywhere. Instrumentation and measurement is needed.

I've investigated many similar things, and the truth usually turned out to be something nobody guessed or assumed. I recall the benchmark you posted where you guessed that dmd's integer code generation was woefully deficient. Examining the actual output showed that there wasn't a dime's worth of difference in the code generated from dmd vs gcc.

The problem turned out to be the long division runtime library function. Fixing that brought the timings to parity.

No code gen changes whatsoever were needed.

September 06, 2012
On Thursday, 6 September 2012 at 00:00:31 UTC, bearophile wrote:
> Walter Bright:
>
>> I'd like it if you could add some instrumentation to see what accounts for the time difference. I presume they both use the same D source code.
>
> Maybe that performance difference comes from the sum of some metric tons of different little optimizations done by the GCC back-end.
>
> Bye,
> bearophile

In addition to Walter's response, it is very rare for advanced compiler optimisations to make >2x difference on any non-trivial code. Not impossible, but it's definitely suspicious.


September 06, 2012
Walter Bright:

> No code gen changes whatsoever were needed.

In that case I think I didn't specify what subsystem of the D compiler was not "good enough", I have just shown a performance difference. The division was slow, regardless of the cause. This is what's important for the final C/D programmer, not if the cause is a badly written division routine, or a bad/missing optimization stage.

And regarding divisions, currently they are not optimized by dmd if divisors are small (like 10) and statically known.

Bye,
bearophile
September 06, 2012
On 9/6/2012 4:30 AM, Peter Alexander wrote:
>
> In addition to Walter's response, it is very rare for advanced compiler
> optimisations to make >2x difference on any non-trivial code. Not
> impossible, but it's definitely suspicious.
>
>

I love trying to explain to people our debug builds are too slow because they have instrumented too much of the code, and haven't disabled any of it.  A lot of people are pushed into debugging release builds as a result, which is pretty silly.

Now there are some pathological cases:
  non-inlined constructors can sometimes kill in some cases you for 3d vector math type libraries
  128 bit SIMD intrinsics with microsofts compiler in debug builds makes horrifically slow code, each operation has its results written to memory and then is reloaded for the next 'instruction'.  I believe its two order of magnitudes slower (the extra instructions, plus pegging the read and write ports of the CPU hurt quite a lot too).  These tend to be right functions so can be optimized in debug builds selectively . . .