Jump to page: 1 2
Thread overview
Timing too good to be true
Sep 21, 2013
Paul Jurczak
Sep 21, 2013
Paul Jurczak
Sep 21, 2013
bearophile
Sep 22, 2013
Paul Jurczak
Sep 22, 2013
David Nadlinger
Sep 22, 2013
Paul Jurczak
Sep 22, 2013
Paul Jurczak
Sep 22, 2013
David Nadlinger
Sep 21, 2013
David Nadlinger
Sep 22, 2013
Paul Jurczak
Sep 23, 2013
John Colvin
September 21, 2013
I'm running this test program with multiple compilers on Xubuntu and I'm getting impossibly short timings with LDC compared to DMD and GDC (timing details in comments below). I would appreciate any ideas about what is going on here?

/*
Xubuntu 13.04 64-bit Core i5 3450S 2.8GHz:
GDC 4.8.1:     gdc-4.8 -m64 -march=native -fno-bounds-check -frename-registers -frelease -O3
669171001  826ns  e28_0
669171001  824ns  e28_1

DMD64 2.063.2: dmd -O -noboundscheck -inline -release
669171001  1115ns  e28_0
669171001  1955ns  e28_1

LDC 0.11.0: ldmd2 -m64 -O -noboundscheck -inline -release
669171001  25ns  e28_0
669171001  25ns  e28_1
*/

module main;

import std.stdio, std.algorithm, std.range, std.datetime, std.typetuple;


int e28_0(int N = 1002) {
	int diagNumber = 1;					
	int sum        = diagNumber;	

	for (int width = 2; width < N; width += 2)	
		for (int j = 0; j < 4; ++j) {			
			diagNumber += width;				
			sum        += diagNumber;			
		}

	return sum;
}


int e28_1(int N = 1002) {
	int diagNumber = 1;					
	int sum        = diagNumber;	

	foreach (width; iota(2, N, 2))
		foreach (_; 0..4) {			
			diagNumber += width;				
			sum        += diagNumber;			
		}

	return sum;
}


void main()
{
  enum      N = 100_000;
  StopWatch sw;

   foreach (iFun, fun; TypeTuple!(e28_0, e28_1)) {
      int [N] results;
      long[N] timings;

      foreach (i; 0..N) {
         sw.reset;
         sw.start;
         results[i] = fun();
         sw.stop;
         timings[i] = sw.peek.nsecs;
      }

      writeln(results.reduce!min, "  ", timings.reduce!min, "ns  e28_", iFun);
  }
}


September 21, 2013
On Saturday, 21 September 2013 at 22:07:27 UTC, Paul Jurczak wrote:
> [...]

One more thing: I have multiple versions of gcc installed, but my ldc installation was from binaries provided on dlang.org. I'm not sure if there are any gcc dependencies, which could cause this problem.
September 21, 2013
Paul Jurczak:

> I'm running this test program with multiple compilers on Xubuntu and I'm getting impossibly short timings with LDC compared to DMD and GDC (timing details in comments below). I would appreciate any ideas about what is going on here?

Use -output-s on ldc and show us the asm of just the two functions.

Bye,
bearophile
September 21, 2013
On Sun, Sep 22, 2013 at 12:07 AM, Paul Jurczak <pauljurczak@yahoo.com> wrote:
> I'm running this test program with multiple compilers on Xubuntu and I'm
> getting impossibly short timings with LDC compared to DMD and GDC (timing
> details in comments below). I would appreciate any ideas about what is going
> on here?
> [...]

First, a remark unrelated to the actual issue at hand: Your program uses at least 2.4 MB of stack (for the result/timing variables), which is large enough to cause issues e.g. on Win32, where the default stack size is 1 MB.

That being said, the LDC-compiled version is so fast because the LLVM optimizer constant-folds the entire e28 functions away. Thus, you are timing the single store of the constant 669171001 to a memory location (or rather, the overhead involved in calling the StopWatch functions).

Hope this answers your question,
David
September 22, 2013
On Saturday, 21 September 2013 at 22:47:16 UTC, bearophile wrote:
> Paul Jurczak:
>
>> I'm running this test program with multiple compilers on Xubuntu and I'm getting impossibly short timings with LDC compared to DMD and GDC (timing details in comments below). I would appreciate any ideas about what is going on here?
>
> Use -output-s on ldc and show us the asm of just the two functions.
>
> Bye,
> bearophile

Here it is:

	.globl	_D4main5e28_0FiZi
	.align	16, 0x90
	.type	_D4main5e28_0FiZi,@function
_D4main5e28_0FiZi:
	.cfi_startproc
	movl	$1, %eax
	cmpl	$3, %edi
	jl	.LBB0_3
	movl	$2, %r8d
	movl	$20, %edx
	movl	$8, %esi
	movl	%eax, %ecx
	.align	16, 0x90
.LBB0_2:
	addl	%edx, %eax
	leal	(%rax,%rcx,4), %eax
	addl	%esi, %ecx
	addl	$8, %esi
	addl	$20, %edx
	addl	$2, %r8d
	cmpl	%edi, %r8d
	jl	.LBB0_2
.LBB0_3:
	ret
.Ltmp0:
	.size	_D4main5e28_0FiZi, .Ltmp0-_D4main5e28_0FiZi
	.cfi_endproc

	.globl	_D4main5e28_1FiZi
	.align	16, 0x90
	.type	_D4main5e28_1FiZi,@function
_D4main5e28_1FiZi:
	.cfi_startproc
	movl	$1, %eax
	cmpl	$3, %edi
	jl	.LBB1_4
	leal	-3(%rdi), %ecx
	shrl	$31, %ecx
	leal	-3(%rdi,%rcx), %ecx
	andl	$-2, %ecx
	negl	%ecx
	leal	-3(%rdi,%rcx), %ecx
	leal	1(%rdi), %edx
	subl	%ecx, %edx
	cmpl	$2, %edx
	je	.LBB1_4
	decl	%edi
	subl	%ecx, %edi
	movl	$1, %ecx
	movl	$20, %edx
	movl	$8, %esi
	movl	$1, %eax
	.align	16, 0x90
.LBB1_3:
	addl	%edx, %eax
	leal	(%rax,%rcx,4), %eax
	addl	%esi, %ecx
	addl	$8, %esi
	addl	$20, %edx
	addl	$-2, %edi
	jne	.LBB1_3
.LBB1_4:
	ret
.Ltmp1:
	.size	_D4main5e28_1FiZi, .Ltmp1-_D4main5e28_1FiZi
	.cfi_endproc


September 22, 2013
On Sun, Sep 22, 2013 at 2:27 AM, Paul Jurczak <pauljurczak@yahoo.com> wrote:
> On Saturday, 21 September 2013 at 22:47:16 UTC, bearophile wrote:
>>
>> Paul Jurczak:
>>
>>> I'm running this test program with multiple compilers on Xubuntu and I'm getting impossibly short timings with LDC compared to DMD and GDC (timing details in comments below). I would appreciate any ideas about what is going on here?
>>
>>
>> Use -output-s on ldc and show us the asm of just the two functions.
>>
>> Bye,
>> bearophile
>
>
> Here it is:
> [...]

As mentioned in my previous message, this is actually irrelevant for your benchmark, as the functions are not even executed once by the optimized test program.

David
September 22, 2013
On Saturday, 21 September 2013 at 23:15:05 UTC, David Nadlinger wrote:
>> [...]
>
> First, a remark unrelated to the actual issue at hand: Your program
> uses at least 2.4 MB of stack (for the result/timing variables), which
> is large enough to cause issues e.g. on Win32, where the default stack
> size is 1 MB.

Thanks, that explains the segfault when I was using N = 1_000_000.

> That being said, the LDC-compiled version is so fast because the LLVM
> optimizer constant-folds the entire e28 functions away. Thus, you are
> timing the single store of the constant 669171001 to a memory location
> (or rather, the overhead involved in calling the StopWatch functions).
>
> Hope this answers your question,
> David

It sounds convincing, but I tried it again with small Ns. For N = 1, function is called only once, so the minimum of all timings should be accurate, but the result is still unbelievably short:

N = 1
669171001  57ns  e28_0
669171001  58ns  e28_1

N = 2
669171001  42ns  e28_0
669171001  43ns  e28_1

N = 10
669171001  26ns  e28_0
669171001  25ns  e28_1


September 22, 2013
On Sunday, 22 September 2013 at 00:39:43 UTC, David Nadlinger wrote:
>> [...]
>
> As mentioned in my previous message, this is actually irrelevant for
> your benchmark, as the functions are not even executed once by the
> optimized test program.
>
> David

If it is not executed even once, how is the value computed? At compile time?
September 22, 2013
On Sunday, 22 September 2013 at 00:58:42 UTC, Paul Jurczak wrote:
> On Sunday, 22 September 2013 at 00:39:43 UTC, David Nadlinger wrote:
>>> [...]
>>
>> As mentioned in my previous message, this is actually irrelevant for
>> your benchmark, as the functions are not even executed once by the
>> optimized test program.
>>
>> David
>
> If it is not executed even once, how is the value computed? At compile time?

OK, you've wrote "optimizer constant-folds the entire e28 functions away", which means compile time. Very clever optimizer! To defeat this behavior I changed timing loop to:

      foreach (i; 0..N) {
         sw.reset;
         sw.start;
         results[i] = fun(1002 + i%2); // a trick to defeat LLVM optimizer
         sw.stop;
         timings[i] = sw.peek.nsecs;
      }

and now I'm getting sensible results:

GDC 4.8.1:     gdc-4.8 -m64 -march=native -fno-bounds-check -frename-registers -frelease -O3
669171001  830ns  e28_0
669171001  830ns  e28_1

DMD64 2.063.2: dmd -O -noboundscheck -inline -release
669171001  1115ns  e28_0
669171001  1958ns  e28_1

LDC 0.11.0: ldmd2 -m64 -O -noboundscheck -inline -release
669171001  454ns  e28_0
669171001  395ns  e28_1

LDC is smoking fast here!
September 22, 2013
On Sun, Sep 22, 2013 at 6:48 AM, Paul Jurczak <pauljurczak@yahoo.com> wrote:
> OK, you've wrote "optimizer constant-folds the entire e28 functions away", which means compile time. Very clever optimizer! To defeat this behavior I changed timing loop to:

By the way, there is an LDC-specific pragma to selectively disable
inlining for some functions that you could have used to the same
effect: http://wiki.dlang.org/LDC-specific_language_changes#LDC_never_inline.

David
« First   ‹ Prev
1 2