Jump to page: 1 25  
Page
Thread overview
From a C++/JS benchmark
Aug 03, 2011
bearophile
Aug 03, 2011
Denis Shelomovskij
Aug 03, 2011
Ziad Hatahet
Aug 03, 2011
Denis Shelomovskij
Aug 03, 2011
Adam D. Ruppe
Aug 03, 2011
David Nadlinger
Aug 04, 2011
Marco Leise
Aug 04, 2011
Adam Ruppe
Aug 03, 2011
Denis Shelomovskij
Aug 03, 2011
bearophile
Aug 04, 2011
Trass3r
Aug 04, 2011
bearophile
Aug 04, 2011
Trass3r
Aug 04, 2011
Trass3r
Aug 05, 2011
Trass3r
Aug 05, 2011
bearophile
Aug 05, 2011
Trass3r
Aug 04, 2011
bearophile
Aug 04, 2011
Trass3r
Aug 04, 2011
bearophile
Aug 04, 2011
Adam Ruppe
Aug 05, 2011
Don
Aug 05, 2011
Trass3r
Aug 05, 2011
bearophile
Aug 05, 2011
Trass3r
Aug 06, 2011
Iain Buclaw
Aug 06, 2011
bearophile
Aug 06, 2011
Iain Buclaw
Aug 06, 2011
bearophile
Aug 06, 2011
Iain Buclaw
Aug 06, 2011
bearophile
Aug 06, 2011
Walter Bright
Aug 06, 2011
bearophile
Aug 06, 2011
Iain Buclaw
Aug 06, 2011
bearophile
Aug 07, 2011
Walter Bright
Aug 07, 2011
bearophile
Aug 08, 2011
bearophile
Aug 07, 2011
Trass3r
August 03, 2011
The benchmark info: http://chadaustin.me/2011/01/digging-into-javascript-performance/

The code, in C++, JS, Java, C#:
https://github.com/chadaustin/Web-Benchmarks/
The C++/JS/Java code runs on a single core.

D2 version translated from the C# version (the C++ version uses struct inheritance!): http://ideone.com/kf1tz

Bye,
bearophile
August 03, 2011
03.08.2011 18:20, bearophile:
> The benchmark info:
> http://chadaustin.me/2011/01/digging-into-javascript-performance/
>
> The code, in C++, JS, Java, C#:
> https://github.com/chadaustin/Web-Benchmarks/
> The C++/JS/Java code runs on a single core.
>
> D2 version translated from the C# version (the C++ version uses struct inheritance!):
> http://ideone.com/kf1tz
>
> Bye,
> bearophile

Compilers:
C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
Java: Oracle Java 1.6 with hm... Oracle default settings
C#:   Csc /optimize+
D2:   dmd -O -noboundscheck -inline -release

Type column: working scalar type
Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").

System: Windows XP, Core 2 Duo E6850

-----------------------------------------------------------
  Type  |    C++     |    Java    |     C#     |     D2
-----------------------------------------------------------
float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000
double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000
real    | 32_300_000 |   no real  |   no real  |    203_000
int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
-----------------------------------------------------------

JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++.
Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
August 03, 2011
I believe that "long" in this case is 32 bits in C++, and 64-bits in the remaining languages, hence the same result for int and long in C++. Try with "long long" maybe? :)


--
Ziad


2011/8/3 Denis Shelomovskij <verylonglogin.reg@gmail.com>

> 03.08.2011 18:20, bearophile:
>
>  The benchmark info:
>> http://chadaustin.me/2011/01/**digging-into-javascript-**performance/<http://chadaustin.me/2011/01/digging-into-javascript-performance/>
>>
>> The code, in C++, JS, Java, C#: https://github.com/chadaustin/**Web-Benchmarks/<https://github.com/chadaustin/Web-Benchmarks/> The C++/JS/Java code runs on a single core.
>>
>> D2 version translated from the C# version (the C++ version uses struct
>> inheritance!):
>> http://ideone.com/kf1tz
>>
>> Bye,
>> bearophile
>>
>
> Compilers:
> C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
> Java: Oracle Java 1.6 with hm... Oracle default settings
> C#:   Csc /optimize+
> D2:   dmd -O -noboundscheck -inline -release
>
> Type column: working scalar type
> Other columns: vertices per second (inaccuracy is about 1%) by language
> (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").
>
> System: Windows XP, Core 2 Duo E6850
>
> ------------------------------**-----------------------------
>  Type  |    C++     |    Java    |     C#     |     D2
> ------------------------------**-----------------------------
> float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000 double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000 real    | 32_300_000 |   no real  |   no real  |    203_000 int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
> ------------------------------**-----------------------------
>
> JavaScript vs C++ speed is at the first link of original bearophile's post
> and JS is about 10-20 temes slower than C++.
> Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10
> times faster than D in floating point calculations!? Please, tell me that
> I'm mistaken.
>


August 03, 2011
03.08.2011 22:15, Ziad Hatahet:
> I believe that "long" in this case is 32 bits in C++, and 64-bits in the
> remaining languages, hence the same result for int and long in C++. Try
> with "long long" maybe? :)
>
>
> --
> Ziad
>
>
> 2011/8/3 Denis Shelomovskij <verylonglogin.reg@gmail.com
> <mailto:verylonglogin.reg@gmail.com>>
>
>     03.08.2011 18:20, bearophile:
>
>         The benchmark info:
>         http://chadaustin.me/2011/01/__digging-into-javascript-__performance/
>         <http://chadaustin.me/2011/01/digging-into-javascript-performance/>
>
>         The code, in C++, JS, Java, C#:
>         https://github.com/chadaustin/__Web-Benchmarks/
>         <https://github.com/chadaustin/Web-Benchmarks/>
>         The C++/JS/Java code runs on a single core.
>
>         D2 version translated from the C# version (the C++ version uses
>         struct inheritance!):
>         http://ideone.com/kf1tz
>
>         Bye,
>         bearophile
>
>
>     Compilers:
>     C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
>     Java: Oracle Java 1.6 with hm... Oracle default settings
>     C#:   Csc /optimize+
>     D2:   dmd -O -noboundscheck -inline -release
>
>     Type column: working scalar type
>     Other columns: vertices per second (inaccuracy is about 1%) by
>     language (tests from bearophile's message, C++ test is
>     "skinning_test_no_simd.cpp").
>
>     System: Windows XP, Core 2 Duo E6850
>
>     ------------------------------__-----------------------------
>       Type  |    C++     |    Java    |     C#     |     D2
>     ------------------------------__-----------------------------
>     float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000
>     double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000
>     real    | 32_300_000 |   no real  |   no real  |    203_000
>     int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
>     long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
>     ------------------------------__-----------------------------
>
>     JavaScript vs C++ speed is at the first link of original
>     bearophile's post and JS is about 10-20 temes slower than C++.
>     Looks like a spiteful joke... In other words: WTF?! JavaScript is
>     about 10 times faster than D in floating point calculations!?
>     Please, tell me that I'm mistaken.
>
>

Good! This is my first blunder (it's so easy to complitely forget illogical (for me) language design). So, corrected last row:

 Type  |    C++     |    Java    |     C#     |     D2
-------------------------------------------------------------
long    | 5_500_000 |  6_600_000 |  4_400_000 |  5_800_000


Java is the fastest "long" language :)
August 03, 2011
> System: Windows XP, Core 2 Duo E6850

Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
August 03, 2011
On 8/3/11 9:48 PM, Adam D. Ruppe wrote:
>> System: Windows XP, Core 2 Duo E6850
>
> Is this Windows XP 32 bit or 64 bit? That will probably make
> a difference on the longs I'd expect.

It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).

David
August 03, 2011
03.08.2011 22:48, Adam D. Ruppe пишет:
>> System: Windows XP, Core 2 Duo E6850
>
> Is this Windows XP 32 bit or 64 bit? That will probably make
> a difference on the longs I'd expect.

I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according to what is "Windows XP" in wikipedia)
August 03, 2011
Denis Shelomovskij:

> (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").

For a more realistic test I suggest you to time the C++ version that uses the intrinsics too (only for float).


> Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.

Languages aren't slow or fast, their implementations produce assembly that's more or less efficient.

A D1 version fit for LDC V1 with Tango: http://codepad.org/ewDy31UH

Vertices (millions), Linux 32 bit:
  C++ no simd:  29.5
  D:            27.6

LDC based on DMD v1.057 and llvm 2.6, ldc -O3 -release -inline

G++ V4.3.3, -s -O3 -mfpmath=sse -ffast-math -msse3

It's a bit slower than the C++ version, but for most people that's an acceptable difference (and maybe porting the C++ code to D instead of the C# one and using a more modern LLVM you reduce that loss a bit).

Bye,
bearophile
August 04, 2011
> Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.

I'm afraid not. dmd's backend isn't good at floating point calculations.
August 04, 2011
Trass3r:

> I'm afraid not. dmd's backend isn't good at floating point calculations.

Studying a bit the asm it's not hard to find the cause, because this benchmark is quite pure (synthetic, despite I think it comes from real-world code).

This is what G++ generates from the C++ code without intrinsics (the version that uses SIMD intrinsics has a similar look but it's shorter):

movl  (%eax), %edx
movss  4(%eax), %xmm0
movl  8(%eax), %ecx
leal  (%edx,%edx,2), %edx
sall  $4, %edx
addl  %ebx, %edx
testl  %ecx, %ecx
movss  12(%edx), %xmm1
movss  20(%edx), %xmm7
movss  (%edx), %xmm5
mulss  %xmm0, %xmm1
mulss  %xmm0, %xmm7
movss  4(%edx), %xmm6
movss  8(%edx), %xmm4
movss  %xmm1, (%esp)
mulss  %xmm0, %xmm5
movss  28(%edx), %xmm1
movss  %xmm7, 4(%esp)
mulss  %xmm0, %xmm6
movss  32(%edx), %xmm7
mulss  %xmm0, %xmm1
movss  16(%edx), %xmm3
mulss  %xmm0, %xmm7
movss  24(%edx), %xmm2
movss  %xmm1, 16(%esp)
mulss  %xmm0, %xmm4
movss  36(%edx), %xmm1
movss  %xmm7, 8(%esp)
mulss  %xmm0, %xmm3
movss  40(%edx), %xmm7
mulss  %xmm0, %xmm2
mulss  %xmm0, %xmm1
mulss  %xmm0, %xmm7
mulss  44(%edx), %xmm0
leal  12(%eax), %edx
movss  %xmm7, 12(%esp)
movss  %xmm0, 20(%esp)


This is what DMD generates for the same (or quite similar) piece of code:

movsd
mov  EAX,068h[ESP]
imul  EDX,EAX,030h
add  EDX,018h[ESP]
fld  float ptr [EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 038h[ESP]
fld  float ptr 4[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 03Ch[ESP]
fld  float ptr 8[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 040h[ESP]
fld  float ptr 0Ch[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 044h[ESP]
fld  float ptr 010h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 048h[ESP]
fld  float ptr 014h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 04Ch[ESP]
fld  float ptr 018h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 050h[ESP]
fld  float ptr 01Ch[EDX]
mov  CL,070h[ESP]
xor  CL,1
fmul  float ptr 06Ch[ESP]
fstp  float ptr 054h[ESP]
fld  float ptr 020h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 058h[ESP]
fld  float ptr 024h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 05Ch[ESP]
fld  float ptr 028h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 060h[ESP]
fld  float ptr 02Ch[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 064h[ESP]

I think DMD back-end already contains logic to use xmm registers as true registers (not as a floating point stack or temporary holes where to push and pull FP values), so I suspect it doesn't take too much work to modify it to emit FP asm with a single optimization: just keep the values inside registers. In my uninformed opinion all other FP optimizations are almost insignificant compared to this one :-)

Bye,
bearophile
« First   ‹ Prev
1 2 3 4 5