December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On 12/22/10, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> Without any imperical testing, I would guess this has something to do with the lack of inlining for algorithmic functions. This is due primarily to uses of enforce, which use lazy parameters, which are currently not inlinable (also, ensure you use -O -release -inline for the most optimized code).
>
I have just tried removing enforce usage from Phobos and recompiling the library, and compiling again with -O -release -inline. It doesn't appear to make a difference in the timing speed over multiple runs.
|
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gary Whatmore | On 12/22/2010 5:31 PM, Gary Whatmore wrote:
> 5) you were using old d runtime garbage collector. One fellow here made a precise state of the art GC which beats even Java's 20 year old GC and C#. Patch your dmd to use this instead.
Could you point me to more information? This sounds interesting.
|
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote: > Surely Lua looks like a far worse language regarding optimization > opportunities. But people around here (like you) must start to realize that > JIT compilation is not what it used to be. Today the JIT compilation done by > the JavaVM is able to perform de-virtualization, dynamic loop unrolling, The Java JIT did that 15 years ago. I think you forget that I wrote on a Java compiler way back in the day (the companion JIT was done by Steve Russell, yep, the Optlink guy). > inlining across "compilation units", dmd does cross-module inlining. > and some other optimizations that > despite are "not language issues" are not done or not done enough by static > compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about > as fast in Java and C, and for some sub-benchmarks it is faster :-) Inherent Java slowdowns are not in numerical code. The Java language isn't inherently worse at numerics than C, C++, D, etc. Where Java is inherently worse is in its excessive reliance on dynamic allocation (and that is rare in numeric code - you don't "new" a double). |
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrej Mitrovic | Andrej Mitrovic wrote:
> I have just tried removing enforce usage from Phobos and recompiling
> the library, and compiling again with -O -release -inline. It doesn't
> appear to make a difference in the timing speed over multiple runs.
Try looking at the obj2asm dump of the inner loop.
|
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | > I think you forget that I wrote on a Java compiler way back in the day I remember it :-) > dmd does cross-module inlining. I didn't know this, much... Bye, bearophile |
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andreas Mayer | On 12/22/10 4:04 PM, Andreas Mayer wrote: > To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: > >> auto L = iota(0.0, 10000000.0); >> auto L2 = map!"a / 2"(L); >> auto L3 = map!"a + 2"(L2); >> auto V = reduce!"a + b"(L3); > > It runs in 281 ms on my computer. Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 10000000.0); auto V = reduce!"a + b"(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative. Another issue is the implementation of reduce. Reduce is fairly general which may mean that it generates mediocre code for that particular case. We can always optimize the general case and perhaps specialize for select cases. Once we figure where the problem is, there are numerous possibilities to improve the code: 1. Have iota check in the constructor whether the limits allow ++ to be precise. If so, use that. Of course, that means an extra runtime test... 3. Give up on iota being a random access or bidirectional range. If it's a forward range, we don't need to worry about going backwards. 4. Improve reduce as described above. > The same code in Lua (using LuaJIT) runs in 23 ms. > > That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? > > The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. > > The full D version is here: http://pastebin.com/R5AGHyPx > The Lua version: http://pastebin.com/Sa7rp6uz > Lua version that emulates ranges: http://pastebin.com/eAKMSWyr > > Could someone help me solving this mystery? > > Or is D, unlike I thought, not suitable for high performance computing? What should I do? Thanks very much for taking the time to measure and post results, this is very helpful. As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment. Nevertheless, we need to look into improving this particular microbenchmark. Please don't forget to print the result of the computation in both languages, as there's always the possibility of some oversight. Andrei |
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei: > As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment. From other tests I have seen that often FP-heavy code is faster with Lua-JIT than with D-DMD. But on average the speed difference is much less than 10 times, generally no more than 2 times. One benchmark, Lua and D code (both OOP and C-style included, plus several manually optimized D versions): http://tinyurl.com/yeo2g8j Bye, bearophile |
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andreas Mayer | On 12/22/10 4:04 PM, Andreas Mayer wrote:
> To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:
>
>> auto L = iota(0.0, 10000000.0);
>> auto L2 = map!"a / 2"(L);
>> auto L3 = map!"a + 2"(L2);
>> auto V = reduce!"a + b"(L3);
>
> It runs in 281 ms on my computer.
>
> The same code in Lua (using LuaJIT) runs in 23 ms.
>
> That's about 10 times faster. I would have expected D to be faster. Did I do something wrong?
>
> The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast.
>
> The full D version is here: http://pastebin.com/R5AGHyPx
> The Lua version: http://pastebin.com/Sa7rp6uz
> Lua version that emulates ranges: http://pastebin.com/eAKMSWyr
>
> Could someone help me solving this mystery?
>
> Or is D, unlike I thought, not suitable for high performance computing? What should I do?
I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms.
I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop.
Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks!
Andrei
// D version, with std.algorithm
// ~ 281 ms, using dmd 2.051 (dmd -O -release -inline)
import std.algorithm;
import std.stdio;
import std.range;
import std.traits;
struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) {
private N start, end, current;
private S step;
this(N start, N end, S step)
{
this.start = start;
this.end = end;
this.step = step;
current = start;
}
/// Range primitives
@property bool empty() const { return current >= end; }
/// Ditto
@property N front() { return current; }
/// Ditto
alias front moveFront;
/// Ditto
void popFront()
{
assert(!empty);
current += step;
}
@property Iota2 save() { return this; }
}
auto iota2(B, E, S)(B begin, E end, S step)
if (is(typeof((E.init - B.init) + 1 * S.init)))
{
return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step);
}
void main(string args[]) {
double result;
auto limit = 10_000_000.0;
if (args.length > 1) {
writeln("iota");
auto L = iota2(0.0, limit, 1.0);
result = reduce!"a + b"(L);
} else {
writeln("baseline");
result = 0.0;
for (double i = 0; i != limit; ++i) {
result += i;
}
}
writefln("%f", result);
}
|
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | That's odd, I'm getting opposite results: iota = 78ms baseline = 187ms Andreas' old code gives: 421ms This is over multiple runs so I'm getting the average out of about 20 runs. On 12/23/10, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > On 12/22/10 4:04 PM, Andreas Mayer wrote: >> To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: >> >>> auto L = iota(0.0, 10000000.0); >>> auto L2 = map!"a / 2"(L); >>> auto L3 = map!"a + 2"(L2); >>> auto V = reduce!"a + b"(L3); >> >> It runs in 281 ms on my computer. >> >> The same code in Lua (using LuaJIT) runs in 23 ms. >> >> That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? >> >> The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. >> >> The full D version is here: http://pastebin.com/R5AGHyPx >> The Lua version: http://pastebin.com/Sa7rp6uz >> Lua version that emulates ranges: http://pastebin.com/eAKMSWyr >> >> Could someone help me solving this mystery? >> >> Or is D, unlike I thought, not suitable for high performance computing? What should I do? > > I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. > > I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. > > Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! > > > Andrei > > // D version, with std.algorithm > // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) > > import std.algorithm; > import std.stdio; > import std.range; > import std.traits; > > struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { > private N start, end, current; > private S step; > this(N start, N end, S step) > { > this.start = start; > this.end = end; > this.step = step; > current = start; > } > /// Range primitives > @property bool empty() const { return current >= end; } > /// Ditto > @property N front() { return current; } > /// Ditto > alias front moveFront; > /// Ditto > void popFront() > { > assert(!empty); > current += step; > } > @property Iota2 save() { return this; } > } > > auto iota2(B, E, S)(B begin, E end, S step) > if (is(typeof((E.init - B.init) + 1 * S.init))) > { > return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); > } > > void main(string args[]) { > double result; > auto limit = 10_000_000.0; > if (args.length > 1) { > writeln("iota"); > auto L = iota2(0.0, limit, 1.0); > result = reduce!"a + b"(L); > } else { > writeln("baseline"); > result = 0.0; > for (double i = 0; i != limit; ++i) { > result += i; > } > } > writefln("%f", result); > } > |
December 23, 2010 Re: Why is D slower than LuaJIT? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu Wrote:
> Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks!
Your version: 40 ms (iota and baseline give the same timings)
LuaJIT with map calls removed: 21 ms
Interesting results.
|
Copyright © 1999-2021 by the D Language Foundation