March 19, 2008
For benchmarks that operate on a higher, language level see this comparison of xml libraries, D comes on top with tango's implementation: http://dotnot.org/blog/index.php

Here are the slides of the presentation where the underlying ideas were
discussed:
http://s3.amazonaws.com/dconf2007/Kris_Array_Slicing.pdf
March 19, 2008
Walter Bright Wrote:

> Matthew Allen wrote:
> > I am looking to use D for programming a high speed vision application which was previously written in C/C++. I have done some arbitary speed tests and am finding that C/C++ seems to be faster than D by a magnitude of about 3 times. I have done some simple loop tests that increment a float value by some number and also some memory allocation/deallocation loops and C/C++ seems to come out on top each time. Is D meant to be faster or as fast as C/C++ and if so how can I optimize the code. I am using -inline, -O, and -release.
> > 
> > An example of a simple loop test I ran is as follows:
> > 
> > DWORD start = timeGetTime(); int i,j,k; float dx=0; for(i=0;
> > i<1000;i++) for(j=0; j<1000;j++) for(k=0; k<10; k++) { dx++; } DWORD
> > end = timeGetTime();
> > 
> > In C++ int and doubles. The C++ came back with a time of 15ms, and D came back with 45ms.
> 
> Loop unrolling could be a big issue here. DMD doesn't do loop unrolling, but that is not a language issue at all, it's an optimizer issue. It's easy enough to check - get the assembler output of the loop from your compiler and post it here.


Walter you are right. That was the issue. I added a function call into the loop and D came out a lot faster than C++.

Thanks for a great language!!!

March 20, 2008
On Thu, 20 Mar 2008 00:44:02 +0300, Matthew Allen <mallen@creativelifestyles.removeme.com> wrote:

> Walter Bright Wrote:
>
>> Matthew Allen wrote:
>> > I am looking to use D for programming a high speed vision application
>> > which was previously written in C/C++. I have done some arbitary
>> > speed tests and am finding that C/C++ seems to be faster than D by a
>> > magnitude of about 3 times. I have done some simple loop tests that
>> > increment a float value by some number and also some memory
>> > allocation/deallocation loops and C/C++ seems to come out on top each
>> > time. Is D meant to be faster or as fast as C/C++ and if so how can I
>> > optimize the code. I am using -inline, -O, and -release.
>> >
>> > An example of a simple loop test I ran is as follows:
>> >
>> > DWORD start = timeGetTime(); int i,j,k; float dx=0; for(i=0;
>> > i<1000;i++) for(j=0; j<1000;j++) for(k=0; k<10; k++) { dx++; } DWORD
>> > end = timeGetTime();
>> >
>> > In C++ int and doubles. The C++ came back with a time of 15ms, and D
>> > came back with 45ms.
>>
>> Loop unrolling could be a big issue here. DMD doesn't do loop unrolling,
>> but that is not a language issue at all, it's an optimizer issue. It's
>> easy enough to check - get the assembler output of the loop from your
>> compiler and post it here.
>
>
> Walter you are right. That was the issue. I added a function call into the loop and D came out a lot faster than C++.
>
> Thanks for a great language!!!
>


Obviously, you can not say that D integer increment is faster or slower than that of C++, it just doesn't depend on a language design. You could compare DMD to GCC and that would make similar sense to comparison of GCC and, say, ICC or DMC. The difference is a matter of optimization techniques used be the language compilers, not by languages. Surely, higher level language design makes some inpact on general performance, like GC or constant-time array slicing, but not low level variable increments/loops etc.. General D performance would only increase as more vendors will produce commercial D compilers.
March 21, 2008
Matthew Allen wrote:
> Walter you are right. That was the issue. I added a function call
> into the loop and D came out a lot faster than C++.
> 
> Thanks for a great language!!!

You're welcome. To compare D vs C++ as languages, rather than the optimizers or code generators, the most straightforward way is to benchmark dmd vs dmc, and gdc vs gcc.

March 26, 2008
Saaa wrote:
>> If you really want to improve performance on C or C++, do it by
>> profiling your program, and optimize parts where it matters how
>> fast you go.

To Saa: it's pretty late here now, but I'll try to address some of this
briefly.

In general, (I have to start with this. And this post, as every post in
a NG, should be for a broader audience than the one nominally replied
to.) improving performance only means one thing: have the computer do
less to achieve the goal.

>> - simplify

Any time you look at your code from two weeks ago, you'll probably find
the same thing could be done more easily, with less code lines, with
less effort for the computer.

>> - remove unnecessary loops

For example, to sum up the integers from 1 to 100 doesn't mean that
you'd do the obvious loop. Thinking more about it gives Sum = 1+100 +
2+99 ... and further refining it gives 50 * (1+100) = 5050. (Courtesy of
Gauss (http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss, ca 1785, ca 7 years old!)

>> - hoist stuff out of loops as much as possible

    for(int i=0; i<100; i++)
    {
        if (i%3 == 0) sum += i;
    }

converted to

    for(int i=0; i<100/3; i++)
    {
        sum += i;
    }
// ok, stupid example, AND it's late here, AND I haven't compiled it...

>> - iterate or recurse in ways that ease cache miss penalties

Modern processors have caches ranging in size from hundreds of kilobytes
to some megabytes, usually with a separate cache for instructions, and
another for data. Program code, for example, is fetched into the cache
from memory around where the current instruction is located. Thus, the
instructions following the current one are in the cache which the
processor can access much faster than real memory.

Now, if you have your code so that, say, a loop you're executing,
contains subroutine or function calls that actually reside far apart in
memory, then it might be that the processor doesn't understand to keep
all of them in the cache. One way to ease this is to see to it that the
needed functions reside close to each other in ram. One way of trying to
dot this is to have them next to each other in the source code. (Please,
again, understand that it's late here, etc..., so I'm not accurate here,
more like conveying the general idea.)

The same goes for data. Suppose you're Cloe [in "24", the TV series],
and Boss gives you 25 minutes to filter out who of our 2 million
suspects have made any of 100 million recent cellular calls.

You'd have to write a D(!!!!) program to get it done, right? Now,
matching 2 million in an array against 100 million in a stream (yes,
you'd do it against a stream) would be the first tac to do it. But, 25
minutes wouldn't be enough. So, you write your program so that first,
the 2 million suspects are sorted in order of "suspectability", right?
The extra time used for this is more than countered for when the actual
routine is run because now most of the "suspects" appear regularly in
the comparisons, and therefore "stay in the cache".

(Again, it's late here, but you get the idea.)

>> - iterate instead of recurse as much as possible

Any textbook on programming and recursion shows you an example of doing
the Fibonacci numbers. Just write the code as recursive and as iterative
(both from the textbook) and time the results. The difference is appalling.

>> - reduce if/else if/else || && as much as possible
> 
> Is this because of the obvious `less calculations is better`, or
> something else?

Well, any single calculation that you do _within_ the loop, is
calculated as many times as the loop is run. In _most_ cases, what you'd
have inside the loop at first thought, can pretty easily be transported
outside the loop. (Especially with D, where you can have compile-time
things calculated automatically, but also with other languages where
quite a lot of the stuff you'd initially have inside the loop, is
possible to either calculate once before, or convert the whole thing to
calculating other (simpler) things, and then either before the loop or
after it, have some function "rectifying" the end result. (See the
for-loop example above.)

>> - multiply by inverse instead of divide where possible

Division is poison for computers. It's also poison for math
coprocessors. (Hey, those of us that are old enough to have had to do
division of large floating point numbers with pencil and paper at school, know intuitively that multiplication is a piece of cake, compared. Who here would venture to do 123455.2525 / 4.7110211 on only pencil and paper??)

Now, if instead, one could choose to do 123455.2525 * 0.212268206568
(whether with paper-and-pencil, or with the math coprocessor), the
result would be obtained _much_ faster. And with a _lot_ less head ache.

>> - reduce calls to the OS where it's sensible

(This is ridiculous, but I hope it gets the idea across.) Suppose an
idiot Ivan has to write a program that sums up the sizes of all files on
a hard disk. He might first create a function that lists all the files
in a directory with separate OS queries, then edit it to be recursive, then make a list of all the files found with this method, then one-by-one ask the operating system for the sizes of these files.

Some operating systems have a single call that returns a list of the
file names and (among other things) the file sizes and types. Wouldn't
it be faster to use this, sum up the sizes, and then do the same if any
of the entreies turn out to be directories?

(Comment about reducing calls to the OS: It's not as clear-cut as the
above example might say. Not all of us use Windows, where "_any_" OS
call is a "waste of time". So, one should know, or seek wisdom in the OS
docs, (or even source code, where available) before making judgements on
the efficacy of particular tactics.)

> - don't allocate and then delete is you need a ~same amount of memory
>  afterwards.

(Ignoring the typo.)

New-ing and delete-ing are expesive operations. (Time-wise.) Now,
especially if you know up front that you will need to make a lot of
both, it might be smart to first consider (or have your app evaluate, or
even guess) how many of these you might need at most at the same time.
Then it would be smart to allocate an array containing whatever it is
you need to allocate (be it integers, structs or objects, or whatever) at the most.

Then you could write a function myNew(object_or_whatever) that, instead
of allocating with new, would just look at the array and find the first
empty slot in it. Similarly, with delete, you could have a function that
frees the particular slot in this array. While this doesn't sound much
faster than using new and delete (because I'm too tired now to explain
it properly), this is a tried an proven method of making the program run
_much_ faster.

---

>> If you really want to improve performance on C or C++, do it by
>> profiling your program, and optimize parts where it matters how
>> fast you go.

(Quoted again, from the top of this post.) I'd rewrite the above quote, to:

    If you really want to improve performance, do it by profiling.
    And, improve only where the profiling shows you're slow.

In any language.
March 26, 2008
Georg Wrede wrote:
...
> Now, if you have your code so that, say, a loop you're executing, contains subroutine or function calls that actually reside far apart in memory, then it might be that the processor doesn't understand to keep all of them in the cache. One way to ease this is to see to it that the needed functions reside close to each other in ram. One way of trying to dot this is to have them next to each other in the source code. ...

Another way is to use the profiler built into dmd. It can generate a def file with the optimal link order gathered from empirical results:

http://www.digitalmars.com/ctg/trace.html

It's only for win32 however.
March 26, 2008
lutger:
> Another way is to use the profiler built into dmd. It can generate a def
> file with the optimal link order gathered from empirical results:
> http://www.digitalmars.com/ctg/trace.html
> It's only for win32 however.

I think DMD docs deserve to include such thing too, then.

Bye,
bearophile
1 2 3
Next ›   Last »