March 18, 2008
Matthew Allen, the benchmarks on the Shootout site are flawed, but they are probably less flawed by your benchmarks, so I suggest you to take a look at those benchmarks, you can download them and try them on your PC (for example using an Intel compiler for C++, etc).

Bye,
bearophile
March 18, 2008
== Quote from Frits van Bommel (fvbommel@REMwOVExCAPSs.nl)'s article
> Sean Kelly wrote:
> > == Quote from BCS (BCS@pathlink.com)'s article
> >>> DWORD start = timeGetTime();
> >>> 	int i,j,k;
> >>> 	float dx=0;
> >>>     for(i=0; i<1000;i++)
> >>>         for(j=0; j<1000;j++)
> >>>             for(k=0; k<10; k++)
> >>>                 {
> >>>                      dx++;
> >>>                 }
> >>>     DWORD end = timeGetTime();
> >>>
> [snip]
> >
> > D apps also have more going on in the application initialization phase than C++ apps.  For a real apples-apples comparison, you might want to consider using Tango with the "stub" GC plugged in. That just calls malloc/free and has no initialization cost, at the expense of no actual garbage
collection.
> > I'll have to check whether the stub GC compiles with the latest Tango--it's been a while since I
used it.
> How is the startup time relevant, when he appears to be measuring in-process?

It's not.  I only mentioned it because BCS mentioned startup time.


Sean
March 18, 2008
BCS wrote:
> Frits van Bommel wrote:
> 
>> However, after adding 'printf("%d", dx)' the generated code for D and C++ is virtually identical, as are the timings.
> 
> printf is kinda a heavy weight function. how does it compare with some dummy function?

It doesn't seem to make any difference, unless you count executable size.
(I passed 'dx' to a separately-compiled empty C function instead of printf)
March 19, 2008
1) Using a float or double as an incrementor in a tight loop is a bad idea.  Most compilers optimize it out where possible; and so do Agner Fog and Paul Hseih.  They know why such is true better than I.

2) Most compilers optimize stuff out if it's not directly affecting output or external functions or arguments.  This is usually done on a per-function level.  A better optimizer would do it for the whole program.

3) Startup for D is slower even for hello world because D statically links the entirety of phobos and the GC even if you don't ever use them.  This equates to about 80kb of bloat - so it's still dramatically better than Java or C#, but still not "correct".

4) If the GC does a collection cycle, it'll bump the time complexity.  This will happen pseudo-randomly.

~~~

If you really want to improve performance on C or C++, do it by profiling your program, and optimize parts where it matters how fast you go.

- simplify
- remove unnecessary loops
- hoist stuff out of loops as much as possible
- iterate or recurse in ways that ease cache miss penalties
- iterate instead of recurse as much as possible
- reduce if/else if/else || && as much as possible
- multiply by inverse instead of divide where possible
- reduce calls to the OS where it's sensible

If you need to go further, learn assembler.  D's inline one ain't half bad.  You can do things in assembler that you can't do in HLL's.  Things like ror, rcl, sete, cmovcc, prefetchntq, clever XMMX usage and such.

/me is looking forward to when XMMX has byte-array functionality.   Could outperform all x86-32 string stuff by an order of magnitude.
March 19, 2008
> If you really want to improve performance on C or C++, do it by profiling your program, and optimize parts where it matters how fast you go.
>
> - simplify
> - remove unnecessary loops
> - hoist stuff out of loops as much as possible
> - iterate or recurse in ways that ease cache miss penalties
Could you elaborate on this? ^

> - iterate instead of recurse as much as possible
> - reduce if/else if/else || && as much as possible
Is this because of the obvious `less calculations is better`, or something else?

> - multiply by inverse instead of divide where possible
> - reduce calls to the OS where it's sensible
>

- don't allocate and then delete is you need a ~same amount of memory afterwards.


March 19, 2008
"Dan" <murpsoft@hotmail.com> wrote in message news:frpnmc$24u0$1@digitalmars.com...

>
> 3) Startup for D is slower even for hello world because D statically links the entirety of phobos and the GC even if you don't ever use them.  This equates to about 80kb of bloat - so it's still dramatically better than Java or C#, but still not "correct".

If it linked in all of phobos, your programs would start at around 1MB.  It doesn't link in all of phobos.


March 19, 2008
On Wed, 19 Mar 2008 02:45:00 +0200, Dan <murpsoft@hotmail.com> wrote:

> 4) If the GC does a collection cycle, it'll bump the time complexity.  This will happen pseudo-randomly.

Garbage collection can only happen on a memory allocation, turning off the GC will have no effect here (except, perhaps, the initialization time, which isn't what we're measuring here anyway).

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
March 19, 2008
Saaa wrote:
>> - reduce if/else if/else || && as much as possible
> Is this because of the obvious `less calculations is better`, or something else?
I think chiefly this would be so the CPU's branch prediction can be more consistent, or so it doesn't have to guess.. "The elimination of branching is an important concern with today's deeply pipelined processor architectures"

Check out some of the stuff on: http://www.azillionmonkeys.com/qed/optimize.html

 - Paul

March 19, 2008
Watch out with the release flag, I expereienced starnge behaviour.

See: http://d.puremagic.com/issues/show_bug.cgi?id=797
March 19, 2008
Matthew Allen Wrote:

> I am looking to use D for programming a high speed vision application which was previously written in C/C++. I have done some arbitary speed tests and am finding that C/C++ seems to be faster than D by a magnitude of about 3 times. I have done some simple loop tests that increment a float value by some number and also some memory allocation/deallocation loops and C/C++ seems to come out on top each time. Is D meant to be faster or as fast as C/C++ and if so how can I optimize the code. I am using -inline, -O, and -release.
> 
> An example of a simple loop test I ran is as follows:
> 
> DWORD start = timeGetTime();
> 	int i,j,k;
> 	float dx=0;
>     for(i=0; i<1000;i++)
>         for(j=0; j<1000;j++)
>             for(k=0; k<10; k++)
>                 {
>                      dx++;
>                 }
>     DWORD end = timeGetTime();
> 
> In C++ int and doubles. The C++ came back with a time of 15ms, and D came back with 45ms.

I am testing DMD1.0 against MSVC6 compiler. On DMD I used -O and -inline, On MSVC I used -O2.

Taking in the discussion I tried a few more tests and found that D is faster in certain cirumstances so I guess that the speed is down to compiler optimization.

Also of note is that these tests were run in gui applications not console applications. Running in simple console applications D came out on top in all tests.

Here is summary of what I tried. Each test was run 100 times average given.

double Add(double a, double b) {return a+b;}

DWORD start = timeGetTime();
int i,j,k;
double dx=0;

for(i=0; i<1000;i++)
        for(j=0; j<1000;j++)
            for(k=0; k<10;k++)
            {
                dx++;  // test 1 - simple increment on dx
	dx = Add(i+0.5, j);  // test 2 - funcion to change dx
	dx+=Add(i+0.5,j); // test 3 - function increment on dx
             }

DWORD end = timeGetTime();


For test 1: DMD [42ms]  MSVC6 [15ms]
For test 2: DMD [9ms]   MSVC6 [100ms]
For test 3: DMD [42ms]  MSVC6 [109ms]