June 15, 2008
baleog are you Marco? (same ip)
What kind of hardware do you have?
Because Marco also had some strange speed problems I couldn't replicate.


June 15, 2008
"Saaa" <empty@needmail.com> wrote in message news:g340sc$d1j$1@digitalmars.com...
> baleog are you Marco? (same ip)
> What kind of hardware do you have?
> Because Marco also had some strange speed problems I couldn't replicate.

They have the same IP because they both used the web interface.  You'll notice that everyone who uses the web interface has the same IP.


June 16, 2008
Unknown W. Brackets Wrote:

> What about switches?  Your program uses arrays; if you have array bounds checks enabled, that could easily account for the difference.
> 
> One way to see is dump the assembly (I think there's a utility called dumpobj included with dmd) and compare.  Obviously, it's doing something differently - there's nothing instrinsically "slower" about the language for sure.
> 
> Also - keep in mind that gdc doesn't take advantage of all the optimizations that gcc is able to provide, at least at this time.  A couple of bytes can go a long long way if not optimized right.

There's another classic benchmark issue that you could be stumbling over.  The sample code you posted throws away the results inside the function.

GCC C can detect that the result of the computations are not used, and optimize everything out of existence.  That kind of difference could easily explain the speed difference you're seeing.

If you're going to do this kind of micro-benchmark, you need to print the result of computation or otherwise convince the compiler you need the result.



June 16, 2008
"Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message news:g336hl$10c8$1@digitalmars.com...
> "baleog" <maccarka@yahoo.com> wrote in message news:g32umu$11kq$1@digitalmars.com...
>> Tomas Lindquist Olsen Wrote:
>>
>>> What switches did you use to compile? Not much info you're giving ...
>>
>> Ubuntu-6.06
>> dmd-2.0.14 - 40sec witth n=500
>> dmd -O -release -inline test.d
>> gdc-0.24 - 32sec
>> gdmd -O -release test.d
>> and gcc-4.0.3 - 1.5sec
>> gcc test.c
>>
>> so gcc without optimization runs 20 times faster than gdc
>> but i can't find how to suppress array bound checking
>
> Array bounds checking is off as long as you specify -release.
>
> I don't know if your computer is just really, REALLY slow, but out of curiosity I tried running the D program on my computer.  It completes in 1.2 seconds.
>
> Also, using malloc/free vs. new/delete shouldn't much matter in this program, because you make all of three allocations, all before any loops. The GC is never going to be called during the program.

I agree, but nonetheless the malloc version runs much faster on my systems (both Linux/Windows, P4 and Core2, all compiled w/ -O -inline -release). The relative performance difference gets larger as n increases:

n            malloc        GC
100        0.094         0.328
200        0.140         1.859
300        0.203         6.094
400        0.312        14.141
500        0.547        27.625

import std.conv;

void main(string[] args)
{
   if(args.length > 1)
       test(toInt(args[1]));
   else
       printf("usage: mm nnn\n");
}

version(malloc)
{
import std.c.stdlib;
}

void test(int n)
{
 version(malloc)
 {
 float* xs = cast(float*)malloc(n*n*float.sizeof);
 float* ys = cast(float*)malloc(n*n*float.sizeof);
 }
 else
 {
 float[] xs = new float[n*n];
 float[] ys = new float[n*n];
 }

 for(int i = n-1; i>=0; --i) {
   xs[i] = 1.0;
 }
 for(int i = n-1; i>=0; --i) {
   ys[i] = 2.0;
 }

 version(malloc)
 {
 float* zs = cast(float*)malloc(n*n*float.sizeof);
 }
 else
 {
 float[] zs = new float[n*n];
 }

 for (int i=0; i<n; ++i) {
   for (int j=0; j<n; ++j) {
     float s = 0.0;
     for (int k=0; k<n; ++k) {
       s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
     }
     zs[j+ (i*n)] =  s;
   }
 }

 version(malloc)
 {
 free(zs);
 free(ys);
 free(xs);
 }
 else
 {
 delete xs;
 delete ys;
 delete zs;
 }
}

June 16, 2008
On 2008-06-15 13:53:30 +0200, baleog <maccarka@yahoo.com> said:

> Thank you for your replies! I used malloc instead of new and run time was about 1sec

But you probably did not understand why... and it seems that neither did others around here...

Indeed it is a subtle pitfall in which it is easy to fall.

When you benchmark
1) print something depending on the result like the sum of everything (it is not the main issue in this case, but doing it would have probably shown the problem), so you can also have at least a tiny chance to notice if your algorithm is wrong

2) NaNs
operations involving NaNs depending on the IEEE compliance requested on the processor can be 1000 times slower!!!!!!!!
D (very thoughtfully, as it makes spotting errors easier) initializes the floating point numbers with NaNs (unlike C).
-> your results follow

if you use malloc, the memory is not initialized with NaNs -> performance

manual malloc in this case is definitely not requested

writing a benchmark can be subtle... benchmarking correct code is easier...

Fawzi

June 16, 2008
On 2008-06-16 16:32:56 +0200, Fawzi Mohamed <fmohamed@mac.com> said:

> On 2008-06-15 13:53:30 +0200, baleog <maccarka@yahoo.com> said:
> 
>> Thank you for your replies! I used malloc instead of new and run time was about 1sec
> 
> But you probably did not understand why... and it seems that neither did others around here...
> 
> Indeed it is a subtle pitfall in which it is easy to fall.
> 
> When you benchmark
> 1) print something depending on the result like the sum of everything (it is not the main issue in this case, but doing it would have probably shown the problem), so you can also have at least a tiny chance to notice if your algorithm is wrong
> 
> 2) NaNs

ehm, sorry...
You do initialize everything...
ehm, never post without testing...

Fawzi

June 16, 2008
"Dave" <Dave_member@pathlink.com> wrote in message news:g34sja$2m1a$1@digitalmars.com...

>
> I agree, but nonetheless the malloc version runs much faster on my systems (both Linux/Windows, P4 and Core2, all compiled w/ -O -inline -release). The relative performance difference gets larger as n increases:
>
> n            malloc        GC
> 100        0.094         0.328
> 200        0.140         1.859
> 300        0.203         6.094
> 400        0.312        14.141
> 500        0.547        27.625

I'm sorry, but using your code, I can't reproduce times anywhere near that. I'm on Windows, DMD, Athlon X2 64.  Here are my results:

Phobos:

 n    malloc       GC
------------------------
100  0.005206   0.005285
200  0.045083   0.045199
300  0.148954   0.148920
400  0.400136   0.404554
500  0.933754   1.076060

Tango:

 n    malloc       GC
------------------------
100  0.005221   0.005298
200  0.045342   0.044910
300  0.150753   0.149157
400  0.402951   0.403343
500  0.946041   1.073466

Tested with both Tango and Phobos to be sure, and the times are not really any different between the two.

The malloc and GC times don't really differ until n=500, and even then it's not by much.


June 16, 2008
Jarrett Billingsley Wrote:

> I'm sorry, but using your code, I can't reproduce times anywhere near that.

Maybe it depends on hardware? And `new` effictiveness depends on used hardware.

my /proc/cpuinfo:
intel celeron 1.5GHz
flags: fpu, vme, de, tsk, msr, pae, mce,  cx8, apic, sep, mtrr, pge, mca, cmov, pat, clflush, dts, acpi, mmx, fxsr, sse, sse2, ss, tm, pbe, nx
June 16, 2008
Saaa Wrote:

> baleog are you Marco? (same ip)
No
> What kind of hardware do you have?
HP Compaq nx6110. Ubuntu Linux 6.06
> Because Marco also had some strange speed problems I couldn't replicate.
> 
> 

June 16, 2008
On 2008-06-16 16:40:16 +0200, Fawzi Mohamed <fmohamed@mac.com> said:

> On 2008-06-16 16:32:56 +0200, Fawzi Mohamed <fmohamed@mac.com> said:
> 
>> On 2008-06-15 13:53:30 +0200, baleog <maccarka@yahoo.com> said:
>> 
>>> Thank you for your replies! I used malloc instead of new and run time was about 1sec
>> 
>> But you probably did not understand why... and it seems that neither did others around here...
>> 
>> Indeed it is a subtle pitfall in which it is easy to fall.
>> 
>> When you benchmark
>> 1) print something depending on the result like the sum of everything (it is not the main issue in this case, but doing it would have probably shown the problem), so you can also have at least a tiny chance to notice if your algorithm is wrong
>> 
>> 2) NaNs
> 
> ehm, sorry...
> You do initialize everything...
> ehm, never post without testing...
> 
> Fawzi

I tested... and well I was actually right (I should have trusted my gut feeling a little more...)

NaN is the culprit.

check your algorithm (you initialize, backwards for some strange reason) just part of the arrays...
putting
 xs[] = 1.0;
 ys[] = 2.0;
instead of your strange loops, solves everything...

Fawzi