Thread overview
Re: Programming language benchmark
Jun 25, 2011
bearophile
Jun 26, 2011
Don
Jun 26, 2011
Timon Gehr
Jun 26, 2011
Don
Jun 26, 2011
Walter Bright
Jun 28, 2011
Caligo
June 25, 2011
Don:

Sorry for my slow answer, I was quite busy for days.


> I've never heard that claim before. Do you have evidence for that?

I compare/convert code to D every day, so I am aware that D code compiled with DMD is often slower than C/C++ code compiled with GCC. Since some years I even keep a collection of snippets of slow code.

But I am also aware that the low performance has many different causes, like some missing inlining, missing loop unrolling, etc, so spotting a clear and small case of integer arithmetic code that causes a slow down, to give you evidence, is not easy. So I am sorry for my overly broad claim.


>If it is true, there's a strong possibility that it's a small, fixable issue (for example, DMD used to have terrible performance for ulong multiplication).<

You are right, the case I'm going to show is a precise problem that's fixable.

-----------------------

// C code
#include "limits.h"
#include "stdio.h"

int divideBySeven(int x) {
    return x / 7;
}

int main() {
    int i = INT_MAX;
    int r;
    while (i--)
        r = divideBySeven(i);

    printf("%d\n", r);
    return 0;
}

-----------------------

// D code
int divideBySeven(int x) {
    return x / 7;
}

void main() {
    int i = int.max;
    int r;
    while (i--)
        r = divideBySeven(i);
    printf("%d\n", r);
}

-----------------------

Asm from the C version:

_divideBySeven:
	pushl	%ebx
	movl	$-1840700269, %ebx
	movl	8(%esp), %ecx
	movl	%ebx, %eax
	popl	%ebx
	imull	%ecx
	leal	(%edx,%ecx), %eax
	sarl	$31, %ecx
	sarl	$2, %eax
	subl	%ecx, %eax
	ret

_main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebx
	movl	$-1840700269, %ebx
	pushl	%ecx
	subl	$20, %esp
	call	___main
	movl	$2147483646, %ecx
	.p2align 4,,10
L4:
	movl	%ecx, %eax
	imull	%ebx
	movl	%ecx, %eax
	addl	%ecx, %edx
	sarl	$31, %eax
	sarl	$2, %edx
	decl	%ecx
	subl	%eax, %edx
	cmpl	$-1, %ecx
	jne	L4
	movl	%edx, 4(%esp)
	movl	$LC0, (%esp)
	call	_printf
	addl	$20, %esp
	xorl	%eax, %eax
	popl	%ecx
	popl	%ebx
	leal	-4(%ecx), %esp
	ret
	.def	_printf;	.scl	2;	.type	32;	.endef

-----------------------

Asm from the D version:

_D9int_div_d13divideBySevenFiZi	comdat
		mov	ECX,7
		cdq
		idiv	ECX
		ret

__Dmain	comdat
L0:		push	EAX
		push	EBX
		mov	EBX,07FFFFFFFh
		push	ESI
		xor	ESI,ESI
		test	EBX,EBX
		lea	EBX,-1[EBX]
		je	L24
L11:		mov	EAX,EBX
		mov	ECX,7
		cdq
		idiv	ECX
		test	EBX,EBX
		mov	ESI,EAX
		lea	EBX,-1[EBX]
		jne	L11
L24:		push	ESI
		mov	EDX,offset FLAT:_DATA
		push	EDX
		call	near ptr _printf
		add	ESP,8
		xor	EAX,EAX
		pop	ESI
		pop	EBX
		pop	ECX
		ret

-----------------------

For a more real case see: http://d.puremagic.com/issues/show_bug.cgi?id=5607

Bye,
bearophile
June 26, 2011
bearophile wrote:
> Don:
> 
> Sorry for my slow answer, I was quite busy for days.
> 
> 
>> I've never heard that claim before. Do you have evidence for that?
> 
> I compare/convert code to D every day, so I am aware that D code compiled with DMD is often slower than C/C++ code compiled with GCC. Since some years I even keep a collection of snippets of slow code.
> 
> But I am also aware that the low performance has many different causes, like some missing inlining, missing loop unrolling, etc, so spotting a clear and small case of integer arithmetic code that causes a slow down, to give you evidence, is not easy. So I am sorry for my overly broad claim. 

It is true in general that DMD's inliner is not very good. I _suspect_ that it is the primary cause of most instances of poor integer performance. It's actually part of the front-end, not the back-end. So many of those performance problems won't apply to DMC.

It's also true that the DMD/DMC instruction scheduler doesn't schedule for modern processors. But last I checked, GCC wasn't really much better in practice (you have to be almost perfect to get a benefit from instruction scheduling these days, the hardware does a very good job on unscheduled code).

Otherwise, I don't think there's any major optimisation it misses. But it's quite likely that it misses several very specific minor optimizations.

>> If it is true, there's a strong possibility that it's a small, fixable issue (for example, DMD used to have terrible performance for ulong multiplication).<
> 
> You are right, the case I'm going to show is a precise problem that's fixable. 

[snip]
> -----------------------
> 
> For a more real case see:
> http://d.puremagic.com/issues/show_bug.cgi?id=5607


Thanks, that's helpful. It's a major speed difference (factor of 20, maybe) so it wouldn't have to occur very often to be noticeable.

June 26, 2011
Don wrote:
> bearophile wrote:
>> Don:
>>
>> Sorry for my slow answer, I was quite busy for days.
>>
>>
>>> I've never heard that claim before. Do you have evidence for that?
>>
>> I compare/convert code to D every day, so I am aware that D code compiled with
DMD is often slower than C/C++ code compiled with GCC. Since some years I even keep a
>> collection of snippets of slow code.
>>
>> But I am also aware that the low performance has many different causes, like
some missing inlining, missing loop unrolling, etc, so spotting a clear and small case of
> integer arithmetic code that causes a slow down, to give you evidence, is not
easy. So I am sorry for my overly broad claim.
>
> It is true in general that DMD's inliner is not very good. I _suspect_ that it is the primary cause of most instances of poor integer performance. It's actually part of the front-end, not the back-end. So many of those performance problems won't apply to DMC.
>
> It's also true that the DMD/DMC instruction scheduler doesn't schedule for modern processors. But last I checked, GCC wasn't really much better in practice (you have to be almost perfect to get a benefit from instruction scheduling these days, the hardware does a very good job on unscheduled code).
>
> Otherwise, I don't think there's any major optimisation it misses. But it's quite likely that it misses several very specific minor optimizations.
>
>>> If it is true, there's a strong possibility that it's a small, fixable issue
(for example, DMD used to have terrible performance for ulong multiplication).<
>>
>> You are right, the case I'm going to show is a precise problem that's fixable.
>
> [snip]
>> -----------------------
>>
>> For a more real case see: http://d.puremagic.com/issues/show_bug.cgi?id=5607
>
>
> Thanks, that's helpful. It's a major speed difference (factor of 20, maybe) so it wouldn't have to occur very often to be noticeable.

You may also want to have a look at this paper:

http://www.agner.org/optimize/optimizing_cpp.pdf

I don't know if it still accurately reflects the current state though. Interestingly, it says that DMC is already able to perform the optimization requested by bearophile.

On page 73 starts a tabular that is quite specific about which optimizations the DMC backend is lacking.

Cheers,
-Timon
June 26, 2011
Timon Gehr wrote:
> Don wrote:
>> bearophile wrote:
>>> Don:
>>>
>>> Sorry for my slow answer, I was quite busy for days.
>>>
>>>
>>>> I've never heard that claim before. Do you have evidence for that?
>>> I compare/convert code to D every day, so I am aware that D code compiled with
> DMD is often slower than C/C++ code compiled with GCC. Since some years I even keep a
>>> collection of snippets of slow code.
>>>
>>> But I am also aware that the low performance has many different causes, like
> some missing inlining, missing loop unrolling, etc, so spotting a clear and small
> case of
>> integer arithmetic code that causes a slow down, to give you evidence, is not
> easy. So I am sorry for my overly broad claim.
>> It is true in general that DMD's inliner is not very good. I _suspect_
>> that it is the primary cause of most instances of poor integer
>> performance. It's actually part of the front-end, not the back-end. So
>> many of those performance problems won't apply to DMC.
>>
>> It's also true that the DMD/DMC instruction scheduler doesn't schedule
>> for modern processors. But last I checked, GCC wasn't really much better
>> in practice (you have to be almost perfect to get a benefit from
>> instruction scheduling these days, the hardware does a very good job on
>> unscheduled code).
>>
>> Otherwise, I don't think there's any major optimisation it misses. But
>> it's quite likely that it misses several very specific minor optimizations.
>>
>>>> If it is true, there's a strong possibility that it's a small, fixable issue
> (for example, DMD used to have terrible performance for ulong multiplication).<
>>> You are right, the case I'm going to show is a precise problem that's fixable.
>> [snip]
>>> -----------------------
>>>
>>> For a more real case see:
>>> http://d.puremagic.com/issues/show_bug.cgi?id=5607
>>
>> Thanks, that's helpful. It's a major speed difference (factor of 20,
>> maybe) so it wouldn't have to occur very often to be noticeable.
> 
> You may also want to have a look at this paper:
> 
> http://www.agner.org/optimize/optimizing_cpp.pdf
> 
> I don't know if it still accurately reflects the current state though.

It's a little out of date, DMD now does a couple of things it didn't do when Agner did the testing. Incidentally I contributed a bit to that paper <g>.

> Interestingly, it says that DMC is already able to perform the optimization
> requested by bearophile.
> 
> On page 73 starts a tabular that is quite specific about which optimizations the
> DMC backend is lacking.
> 
> Cheers,
> -Timon
June 26, 2011
On 6/26/2011 2:24 PM, Don wrote:
>> You may also want to have a look at this paper:
>>
>> http://www.agner.org/optimize/optimizing_cpp.pdf
>>
>> I don't know if it still accurately reflects the current state though.
>
> It's a little out of date, DMD now does a couple of things it didn't do when
> Agner did the testing. Incidentally I contributed a bit to that paper <g>.
>

The table has several errors wrt DMC++. For example, DMC++ certainly does function inlining, constant propagation and the branch optimizations.
June 28, 2011
Kind of off topic, but a good place to get benchmark results for many of the programming languages is Sphere Online Judge: http://www.spoj.pl/problems/classical/

They accept solutions in D, but not many have been submitted.  I found a few:

http://www.spoj.pl/ranks/FCTRL/lang=D http://www.spoj.pl/ranks/HASHIT/lang=D http://www.spoj.pl/ranks/ONP/lang=D

Most of the fastest solutions are in C++, but D is pretty close. Maybe we could start submitting solutions :-)