April 16, 2012
On Monday, 16 April 2012 at 07:28:25 UTC, Andrea Fontana wrote:
> Are you on linux/windows/mac?

Windows.

My main question is now *WHY* D is slower than C++ in this program? The code is identical (even the same C functions) in the performance-critical parts, I'm using the "same" compiler backend (gdc/g++), and D was supposed to a fast compilable language. Yet it is up to 50% slower.

What is D doing more than C++ in this program, that accounts for the lost CPU cycles? Or what prevents the D program to be optimized to the C++ level? The D front-end?
April 16, 2012
On 04/17/2012 12:24 AM, ReneSac wrote:
> On Monday, 16 April 2012 at 07:28:25 UTC, Andrea Fontana wrote:
>> Are you on linux/windows/mac?
>
> Windows.
>

DMC runtime !

> My main question is now *WHY* D is slower than C++ in this program? The
> code is identical (even the same C functions)

No. They are not the same. The performance difference is probably explained by the dmc runtime vs. glibc difference, because your biased results are not reproducible on a linux system where glibc is used for both versions.

> in the
> performance-critical parts, I'm using the "same" compiler backend
> (gdc/g++), and D was supposed to a fast compilable language.

/was/is/s

> Yet it is up to 50% slower.
>

This is a fallacy. Benchmarks can only compare implementations, not languages. Furthermore, it is usually the case that benchmarks that have surprising results don't measure what they intend to measure. Your program is apparently rather I/O bound.

> What is D doing more than C++ in this program, that accounts for the
> lost CPU cycles?
> Or what prevents the D program to be optimized to the
> C++ level? The D front-end?

The difference is likely because of differences in external C libraries.
April 17, 2012
On Monday, 16 April 2012 at 22:58:08 UTC, Timon Gehr wrote:
> On 04/17/2012 12:24 AM, ReneSac wrote:
>> Windows.
>>
>
> DMC runtime !
DMC = Digital Mars Compiler? Does Mingw/GDC uses that? I think that both, g++ and GDC compiled binaries, use the mingw runtime, but I'm not sure also.

> No. They are not the same. The performance difference is probably explained by the dmc runtime vs. glibc difference, because your biased results are not reproducible on a linux system where glibc is used for both versions.
Ok, I will benchmark on linux latter.

> This is a fallacy. Benchmarks can only compare implementations, not languages. Furthermore, it is usually the case that benchmarks that have surprising results don't measure what they intend to measure. Your program is apparently rather I/O bound.
Yeah, I'm comparing the implementation, and I made it clear that it may be the GDC front-end that may be the "bottleneck".

And I don't think it is I/O bound. It is only around 10MB/s, whereas my HD can do ~100MB/s. Furthermore, on files more compressible, where the speed was higher, the difference between D and C++ was higher too. And if is in fact I/O bound, then D is MORE than 50% slower than C++.

> The difference is likely because of differences in external C libraries.
Both, the D and C++ versions, use C's stdio library. What is the difference?

April 17, 2012
On Tuesday, 17 April 2012 at 01:30:30 UTC, ReneSac wrote:
> On Monday, 16 April 2012 at 22:58:08 UTC, Timon Gehr wrote:
>> On 04/17/2012 12:24 AM, ReneSac wrote:
>>> Windows.
>>>
>>
>> DMC runtime !
> DMC = Digital Mars Compiler? Does Mingw/GDC uses that? I think that both, g++ and GDC compiled binaries, use the mingw runtime, but I'm not sure also.
>
>> No. They are not the same. The performance difference is probably explained by the dmc runtime vs. glibc difference, because your biased results are not reproducible on a linux system where glibc is used for both versions.
> Ok, I will benchmark on linux latter.
>
>> This is a fallacy. Benchmarks can only compare implementations, not languages. Furthermore, it is usually the case that benchmarks that have surprising results don't measure what they intend to measure. Your program is apparently rather I/O bound.
> Yeah, I'm comparing the implementation, and I made it clear that it may be the GDC front-end that may be the "bottleneck".
>
> And I don't think it is I/O bound. It is only around 10MB/s, whereas my HD can do ~100MB/s. Furthermore, on files more compressible, where the speed was higher, the difference between D and C++ was higher too. And if is in fact I/O bound, then D is MORE than 50% slower than C++.
>
>> The difference is likely because of differences in external C libraries.
> Both, the D and C++ versions, use C's stdio library. What is the difference?

Have you tried profiling it? On Windows you can use AMD CodeAnalyst for that, it works pretty well in my experience and it's free of charge.
April 17, 2012
>
>
>>  DMC = Digital Mars Compiler? Does Mingw/GDC uses that? I think that
> both, g++ and GDC compiled binaries, use the mingw runtime, but I'm not sure also.


you right, only dmd uses dmc environment, gdc uses mingw's.


>
> And I don't think it is I/O bound. It is only around 10MB/s, whereas my HD can do ~100MB/s. Furthermore, on files more compressible, where the speed was higher, the difference between D and C++ was higher too. And if is in fact I/O bound, then D is MORE than 50% slower than C++.


to minimize system load and I/O impact, run the same file in the loop, like 5-10 times, it will be located in kernel cache.

>
>  The difference is likely because of differences in external C libraries.
>>
> Both, the D and C++ versions, use C's stdio library. What is the difference?
>
> probably because gdc backend make worse job on optimizing D AST vs C++
AST. I've got following results:

C:\D>echo off
FF=a.doc
C++ compress a.doc
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.36 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.36 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.33 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.34 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.34 s.
"C++ decompress"
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.50 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.51 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.51 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.50 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.50 s.
"D compress"
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.11 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.08 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.08 s.
"D decompress"
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.19 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.19 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.22 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.25 s.
"Done"

So, what's up? I'm used the same backend too, but DMC (-o -6) vs DMD
(-release -inline -O -noboundscheck).
I don't know DMC optimization flags, so probably results might be better
for it.

Lets try to compile by MS CL (-Ox)

C:\D>echo off
FF=a.doc
C++ compress a.doc
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.03 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.02 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.04 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.03 s.
a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.01 s.
"C++ decompress"
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.08 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.06 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.07 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.07 s.
a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.07 s.
"D compress"
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.08 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s.
a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s.
"D decompress"
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.15 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.19 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s.
a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s.
"Done"

Much better for C++, but D is not so worse and about 1.1*C++ too.

What we see - different compiler, different story. We should not compare
languages for performance but compilers!
So many differencies in compilers and environment and definetelly C++ is
much more mature for performance now but also D has own benefits (faster
development/debugging and more reliable code).

Thanks,
Oleg.
PS. BTW, this code still can be optimized quite a lot.


April 24, 2012
Am Sat, 14 Apr 2012 19:31:40 -0700
schrieb Jonathan M Davis <jmdavisProg@gmx.com>:

> On Sunday, April 15, 2012 04:21:09 Joseph Rushton Wakeling wrote:
> > On 14/04/12 23:03, q66 wrote:
> > > He also uses a class. And -noboundscheck should be automatically induced
> > > by
> > > -release.
> > 
> > ... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?
> 
> In theory. If they don't override anything, then that signals to the compiler that they don't need to be virtual, in which case, they _shouldn't_ be virtual, but that's up to the compiler to optimize, and I don't know how good it is about that right now.

<cynicism>
May I point to this: http://d.puremagic.com/issues/show_bug.cgi?id=7865
</cynicism>

-- 
Marco

April 24, 2012
Am Sat, 14 Apr 2012 21:05:36 +0200
schrieb "ReneSac" <reneduani@yahoo.com.br>:

> I have this simple binary arithmetic coder in C++ by Mahoney and translated to D by Maffi. I added "notrow", "final" and "pure" and "GC.disable" where it was possible, but that didn't made much difference. Adding "const" to the Predictor.p() (as in the C++ version) gave 3% higher performance. Here the two versions:
> 
> http://mattmahoney.net/dc/  <-- original zip
> 
> http://pastebin.com/55x9dT9C  <-- Original C++ version. http://pastebin.com/TYT7XdwX  <-- Modified D translation.
> 
> The problem is that the D version is 50% slower:
> 
> test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)
> 
> Lang| Comp  | Binary size | Time (lower is better)
> C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
> D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
> D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
> 
> 
> The only diference I could see between the C++ and D versions is that C++ has hints to the compiler about which functions to inline, and I could't find anything similar in D. So I manually inlined the encode and decode functions:
> 
> http://pastebin.com/N4nuyVMh  - Manual inline
> 
> D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
> D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s
> 
> Still, the D version is slower. What makes this speed diference? Is there any way to side-step this?
> 
> Note that this simple C++ version can be made more than 2 times faster with algoritimical and io optimizations, (ab)using templates, etc. So I'm not asking for generic speed optimizations, but only things that may make the D code "more equal" to the C++ code.

I noticed the thread just now. I ported fast paq8 (fp8) to D, and with some careful D-ification and optimization it runs a bit faster than the original C program when compiled with the GCC on Linux x86_64, Core 2 Duo. As others said the files are cached in RAM anyway if there is enough available, so you should not be bound by your hard drive speed anyway.
I don't know about this version of paq you ported the coder from, but I try to give you some hints on what I did to optimize the code.

- time portions of your main()
  is the time actually spent at start up or in the compression?

- use structs, where you don't classes don't make your code cleaner

- where ever you have large arrays that you don't need initialized to .init, write:
  int[<large number>] arr = void;
  double[<large number>] arr = void;
  This disables default initialization, which may help you in inner loops.
  Remember that C++ doesn't default initialize at all, so this is an obvious
  way to lose performance against that language.
  Also keep in mind that the .init for floating point types is NaN:
  struct Foo { double[999999] bar; }
  Is not a block of binary zeroes and hence cannot be stored in a .bss
  section in the executable, where it would not take any space at all.
  struct Foo { double[999999] bar = void; }
  On the contrary will not bloat your executable by 7,6 MB!
  Be cautious with:
  class Foo { double[999999] bar = void; }
  Classes' .init don't go into .bss either way. Another reason to use a
  struct where appropriate. (WARNING: Usage of .bss on Linux/MacOS is currently
  broken in the compiler front-end. You'll only see the effect on Windows)

- Mahoney used an Array class in my version of paq, which allocates via calloc.
  Do this as well. You can't win otherwise. Read up a bit on calloc if you want.
  It generally 'allocates' a special zeroed out memory page multiple times.
  No matter how much memory you ask for, it wont really allocate anything until
  you *write* to it, at which point new memory is allocated for you and the
  zero-page is copied into it.
  The D GC on the other hand allocates that memory and writes zeroes to it
  immediately.
  The effect is two fold: First, the calloc version will use much less RAM, if
  the 'allocated' buffers aren't fully used (e.g. you compressed a small file).
  Second, the D GC version is slowed down by writing zeroes to all that memory.
  At high compression levels, paq8 uses ~2 GB of memory that is calloc'ed. You
  should _not_ try to use GC memory for that.

- If there are data structures that are designed to fit into a CPU cache-line
  (I had one of those in paq8), make sure it still has the correct size in your
  D version. "static assert(Foo.sizeof == 64);" helped me find a bug there that
  resulted from switching from C bitfields to the D version (which is a library
  solution in Phobos).

I hope that gives you some ideas what to look for. Good luck!

-- 
Marco

April 24, 2012
Marco Leise:

> I ported fast paq8 (fp8) to D, and with some careful D-ification and optimization it runs a bit faster than the original C program when compiled with the GCC on Linux x86_64, Core 2 Duo.

I guess you mean GDC.
With DMD, even if you are a good D programmer, it's not easy to beat that original C compressor :-)
Do you have a link to your D version?
Matt Mahoney is probably willing to put a link in his site to your D version.


> I don't know about this version of paq you ported the coder from,

It was a very basic coder.


>   The D GC on the other hand allocates that memory and writes zeroes to it immediately.

Is this always done the first time the memory is allocated by the GC?


>   The effect is two fold: First, the calloc version will use much less RAM, if
>   the 'allocated' buffers aren't fully used (e.g. you compressed a small file).

On the other hand in D you may allocate the memory in a more conscious way.


> "static assert(Foo.sizeof == 64);" helped me find a bug there that
>   resulted from switching from C bitfields to the D version (which is a library
>   solution in Phobos).

The Phobos D bitfields aren't required to mimic C, but that's an interesting case. Maybe it's a difference interesting to be taken a look at. Do you have the code of the two C-D versions?

Bye,
bearophile
1 2 3 4
Next ›   Last »