May 16, 2010
Walter Bright:

Thank you for your answers and explanations.

> The back end can be improved for floating point, but for integer/pointer work it is excellent.

Only practical experiments can tell if you are right.
(a Raytracer uses lot of floating point ops. My small raytracers compiled with dmd are some times slower than the same compiled with ldc).


> It does not do link time code generation nor profile guided optimization, although in my experiments such features pay off only in a small minority of cases.

I agree that profile guided optimization on GCC usually pays little, so I usually I don't use it with GCC. My theory is that it is not using the profiling information well enough yet. Reading the asm output of the Java HotSpot (and this is not easy to do) has shown me that HotSpot performs some things that GCC isn't doing yet, that in numerical programs give a good performance increase. Here I have shown one of the simpler and most effective optimizations done by HotSpot thanks to the profile information it continuously collects:
http://llvm.org/bugs/show_bug.cgi?id=5540

Link time optimization, as done by LDC has given a good speedup in several of my programs, I like it enough. It allows to apply all other compiler optimizations more effectively. It's able to decrease the program size too.


> In my experiments on vector array operations, the improvement from the CPU's vector instructions is disappointing. It always seems to get hopelessly bottlenecked by memory bandwidth.

dmd array operations are currently not so useful.
But there are several ways to vectorize code that can give very large (up to 10-16 times) speedups on numerical code.

This is one of the kinds of vectorization: http://gcc.gnu.org/wiki/Graphite http://wiki.llvm.org/Polyhedral_optimization_framework

Another kind of vectorization is performing up to three levels of tiling (when the implemented algorithm is not cache oblivious).

Another kind of vectorization is the usage of all the fields of a SSE (and future AVC) registers in parallel. Doing this well seems very hard for compilers (llvm is not able to do it, gcc does it a bit in some situations, and I don't know what the intel compiler does here, I think the intel compiler performs it only if the given C code is written in a specific way that you often have to find by time-consuming trial and error), I don't know why. So this optimization is often done manually, writing asm by hand... if you look at the asm written in video decoders you can see that it's many times faster than the asm produced from C by the best compilers.

Then there are true parallel optimizations, that means using more than one CPU core to perform operations, examples of this are Cilk, parallel fors done in a simple way or in a more refined way as in Chapel language, and then there are the various message passing implementations, etc.

If you have heavy numerical code and you combine all those things you can often get code 40 times faster or more. To perform all such optimizations you need smart compilers and/or a language that gives lot of semantics to the back-end (as Cilk, Chapel, Fortress).

Bye,
bearophile
May 16, 2010
Alex Makhotin:

> May I ask you why are you planning to port an existing codebase to D? What kind of benefits specifically(except comparable to C performance) you expect from D?

At the moment performance (if compared to C++ code compiled with GCC or ICC) is not a selling point of D.
But D can be advertised for its other quality: compared to C or C++ it's very nice to write D code, it's more handy, and a little safer. This can be enough to to justify a switch from C++ to D :-)
A problem in such advertising strategy is that lot of people I know don't seem to look for a better C++, it seems they want to keep themselves away from anything that smells a bit of C++ :-(

Bye,
bearophile
May 16, 2010
bearophile wrote:
> Walter Bright:
> 
>> This is not true of D. In D, the compiler can<
> 
> Thank you for your answers. At the moment D compilers aren't doing this,

Yes, they are. dmd definitely inlines across source modules.


> The second optmizations it talks about is custom calling conventions:
> 
>> Normally, all functions are either cdecl, stdcall, or fastcall. With custom
>> calling conventions, the back end has enough knowledge that it can pass
>> more values in registers, and less on the stack. This usually cuts code
>> size and improves performance.<

Right, dmd doesn't do custom calling conventions. But, it is not necessary for D to have the linker do them. As I explained, the compiler has as much source available to it as the user wishes to supply.


> The third optimizations it talks about is 'Small TLS Encoding':
> 
>> When you use __declspec(thread) variables, the code generator stores the
>> variables at a fixed offset in each per-thread data area. Without LTCG, the
>> code generator has no idea of how many __declspec(thread) variables there
>> will be. As such, it must generate code that assumes the worst, and uses a
>> four-byte offset to access the variable. With LTCG, the code generator has
>> the opportunity to examine all __declspec(thread) variables, and note how
>> often they're used. The code generator can put the smaller, more frequently
>> used variables at the beginning of the per-thread data area and use a
>> one-byte offset to access them.<

Yes, but you won't find this to be a speed improvement. The various addressing modes all run at the same speed. Furthermore, the use of global variables (and that includes TLS) should be minimized. Use of TLS (or any globals) in a tight loop should be avoided on general principles in favor of caching the value in a local. I don't believe this optimization is worth the effort.

Many compilers spend a lot of time trying to optimize access to statics and globals. This ain't low hanging fruit for any but badly written programs.
May 16, 2010
Robert Clipsham wrote:
> LDC and GDC have no such
> restrictions, you can include them as long as you don't modify the
> source, and if you do then you distribute the source as well as the
> binaries.
> 
	That's a common misconception about the GPL: you have to distribute
the source even if you didn't modify it.

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



May 16, 2010
Walter Bright:

> Right, dmd doesn't do custom calling conventions. But, it is not necessary for D to have the linker do them. As I explained, the compiler has as much source available to it as the user wishes to supply.

I'll talk about this a bit with LLVM devs.
Thank you for all your explanations, you often teach me things.

Bye,
bearophile
May 17, 2010
"Jérôme M. Berger", el 16 de mayo a las 22:50 me escribiste:
> Robert Clipsham wrote:
> > LDC and GDC have no such
> > restrictions, you can include them as long as you don't modify the
> > source, and if you do then you distribute the source as well as the
> > binaries.
> > 
> 	That's a common misconception about the GPL: you have to distribute
> the source even if you didn't modify it.

The source must be available. You usually don't distribute the source if you didn't modify the program because anyone can find it in the original place. But when you do modify it, you must provide a way to access the source.

-- 
Leandro Lucarella (AKA luca)                     http://llucax.com.ar/
----------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------
More than 50% of the people in the world have never made
Or received a telephone call
May 17, 2010
On Sun, 16 May 2010 10:27:57 -0400, Dan W <twinbee42@skytopia.com> wrote:
> 5: How about compatibility with GPGPU stuff like CUDA and OpenCL? Can I just as
> easily write GPGPU programs which run as fast as I can with C/C++?

I have some decent CUDA bindings with a nice high level API that I'd be willing to share/open source. But you still have to write the actual GPU kernels in C/C++.
May 17, 2010
Leandro Lucarella wrote:
> "Jérôme M. Berger", el 16 de mayo a las 22:50 me escribiste:
>> Robert Clipsham wrote:
>>> LDC and GDC have no such
>>> restrictions, you can include them as long as you don't modify the
>>> source, and if you do then you distribute the source as well as the
>>> binaries.
>>>
>> 	That's a common misconception about the GPL: you have to distribute
>> the source even if you didn't modify it.
> 
> The source must be available. You usually don't distribute the source if you didn't modify the program because anyone can find it in the original place. But when you do modify it, you must provide a way to access the source.
> 
	Here is the relevant section in GPLv2:
==============================8<------------------------------
  3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:

    a) Accompany it with the complete corresponding machine-readable
    source code, which must be distributed under the terms of Sections
    1 and 2 above on a medium customarily used for software
interchange; or,

    b) Accompany it with a written offer, valid for at least three
    years, to give any third party, for a charge no more than your
    cost of physically performing source distribution, a complete
    machine-readable copy of the corresponding source code, to be
    distributed under the terms of Sections 1 and 2 above on a medium
    customarily used for software interchange; or,

    c) Accompany it with the information you received as to the offer
    to distribute corresponding source code.  (This alternative is
    allowed only for noncommercial distribution and only if you
    received the program in object code or executable form with such
    an offer, in accord with Subsection b above.)
------------------------------>8==============================

	Since the OP was talking about a commercial distribution, point (c)
does not apply and therefore, source must be redistributed.

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



May 18, 2010
Hi all, due to the slow speed of my browser and multiple posts, I'll be posting just one email which covers everything. Please let me know if replying to each individually is really preferred. Many thanks for all and any help.


> May I ask you why are you planning to port an existing codebase to D? What kind of benefits specifically(except comparable to C performance) you expect from D?
>
> Thank you.

Sure. There's a couple of reasons really. First is that a lot of 'fluff' in
C is rectified in D so that declarations and header files are a thing of
the past. Hence less repetition and housekeeping.
Second reason is (and I know this might sound idealistic), it'd be nice
to promote D more, and get more people using it, since it is a step up
from a C in many regards.

My code is still fairly small (certainly less than 1 million lines :) ), so it
won't be too much hassle.

Walter said:

> It does not do link time code generation nor profile guided optimization, although in my experiments such features pay off only in a small minority of cases.

In VC++, PGO is a great speed help because of inlining, but from what you said
later, this doesn't seem to be so much of an issue with D as (like you said),
it has access to all the code anyway. I'm a little concerned though about the floating
point performance, as raytracing does quite a bit of this of course.

The DMC++ compiler you mentioned sounds interesting too. I'd like to compare performance with that, the VC++ one, and the Intel compiler.

Thanks to Robert, for recommending VisualD and the bindings. I might try all three D compilers to which gets the best speed, but perhaps LDC seems most promising from what you've said. I suppose in the future when many-core becomes prevalent that compiler optimization won't be so much of an issue because of the relative simplicity compared to the tricks of the present day CPU.

One issue I have with the Visual C++ compiler is that it doesn't seem to support
loop unswitching (i.e. doubling up code with boolean If statements). I wonder if
one of the D compilers supports it. I started a thread over at cprogramming
about it here: http://cboard.cprogramming.com/c-programming/126756-lack-compiler-loop-optimization-loop-unswitching.html


> I have some decent CUDA bindings with a nice high level API that I'd be willing to share/open source. But you still have to write the actual GPU kernels in C/C++.

Thanks, I'll bear those in mind.

Cheers, Dan
May 18, 2010
%u wrote:
> The DMC++ compiler you mentioned sounds interesting too. I'd like to compare
> performance with that, the VC++ one, and the Intel compiler.

When comparing D performance with C++, it is best to compare compilers with the same back end, i.e.:

   dmd with dmc
   gcc with gdc
   lcc with ldc

This is because back ends can vary greatly in the code generated.