math.log() benchmark of first 1 billion int using std.parallelism (page 3)

> That's very different to my results. > > I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% I checked again today and the results are interesting, on my pc I don't see any difference between std.math and core.stdc.math with ldc. Here are the results with all compilers. - with std.math: dmd: 4 secs, 878 ms ldc: 5 secs, 650 ms gdc: 9 secs, 161 ms - with core.stdc.math: dmd: 5 secs, 991 ms ldc: 5 secs, 572 ms gdc: 7 secs, 957 ms

On Tuesday, 23 December 2014 at 10:20:04 UTC, Iov Gherman wrote: >> That's very different to my results. >> >> I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% > > I checked again today and the results are interesting, on my pc I don't see any difference between std.math and core.stdc.math with ldc. Here are the results with all compilers. > > - with std.math: > dmd: 4 secs, 878 ms > ldc: 5 secs, 650 ms > gdc: 9 secs, 161 ms > > - with core.stdc.math: > dmd: 5 secs, 991 ms > ldc: 5 secs, 572 ms > gdc: 7 secs, 957 ms These multi-threaded benchmarks can be very sensitive to their environment, you should try running it with nice -20 and do multiple passes to get a vague idea of the variability in the result. Also, it's important to minimise the number of other running processes.

> These multi-threaded benchmarks can be very sensitive to their environment, you should try running it with nice -20 and do multiple passes to get a vague idea of the variability in the result. Also, it's important to minimise the number of other running processes. I did not use the nice parameter but I always ran them multiple times and choose the average time. My system has very few running processes, minimalist ArchLinux with Xfce4 so I don't think the running processes are affecting in any way my tests.

On Tuesday, 23 December 2014 at 10:39:13 UTC, Iov Gherman wrote: > >> These multi-threaded benchmarks can be very sensitive to their environment, you should try running it with nice -20 and do multiple passes to get a vague idea of the variability in the result. Also, it's important to minimise the number of other running processes. > > I did not use the nice parameter but I always ran them multiple times and choose the average time. My system has very few running processes, minimalist ArchLinux with Xfce4 so I don't think the running processes are affecting in any way my tests. And what about single threaded version? Btw. One reason why DMD is faster is because it use fyl2x X87 instruction here is version for others compilers: import std.math, std.stdio, std.datetime; enum SIZE = 100_000_000; version(GNU) { real mylog(double x) pure nothrow { real result; double y = LN2; asm { "fldl %2\n" "fldl %1\n" "fyl2x" : "=t" (result) : "m" (x), "m" (y); } return result; } } else { real mylog(double x) pure nothrow { return yl2x(x, LN2); } } void main() { auto t1 = Clock.currTime(); auto logs = new double[SIZE]; foreach (i; 0 .. SIZE) { logs[i] = mylog(i + 1.0); } auto t2 = Clock.currTime(); writeln("time: ", (t2 - t1)); } But it is faster only on all Intel CPU, but on one of my AMD it is slower than core.stdc.log

On Tuesday, 23 December 2014 at 10:20:04 UTC, Iov Gherman wrote: >> That's very different to my results. >> >> I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% > > I checked again today and the results are interesting, on my pc I don't see any difference between std.math and core.stdc.math with ldc. Here are the results with all compilers. > > - with std.math: > dmd: 4 secs, 878 ms > ldc: 5 secs, 650 ms > gdc: 9 secs, 161 ms > > - with core.stdc.math: > dmd: 5 secs, 991 ms > ldc: 5 secs, 572 ms > gdc: 7 secs, 957 ms Btw. I just noticed small issue with D vs. java, you start messure in D before allocation, but in case of Java after allocation

December 23, 2014

Re: math.log() benchmark of first 1 billion int using std.parallelism

Posted by John Colvin
in reply to Iov Gherman

Permalink

John Colvin

Posted in reply to Iov Gherman

Permalink

On Monday, 22 December 2014 at 17:16:49 UTC, Iov Gherman wrote:
> On Monday, 22 December 2014 at 17:16:05 UTC, bachmeier wrote:
>> On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote:
>>> Hi Guys,
>>>
>>> First of all, thank you all for responding so quick, it is so nice to see D having such an active community.
>>>
>>> As I said in my first post, I used no other parameters to dmd when compiling because I don't know too much about dmd compilation flags. I can't wait to try the flags Daniel suggested with dmd (-O -release -inline -noboundscheck) and the other two compilers (ldc2 and gdc). Thank you guys for your suggestions.
>>>
>>> Meanwhile, I created a git repository on github and I put there all my code. If you find any errors please let me know. Because I am keeping the results in a big array the programs take approximately 8Gb of RAM. If you don't have enough RAM feel free to decrease the size of the array. For java code you will also need to change 'compile-run.bsh' and use the right memory parameters.
>>>
>>>
>>> Thank you all for helping,
>>> Iov
>>
>> Link to your repo?
>
> Sorry, forgot about it:
> https://github.com/ghermaniov/benchmarks

For posix-style threads, a per-thread workload of 200 calls to log seems rather small. It would interesting to see a graph of execution-time as a function of workgroup-size.

Traditionally one would use a workgroup size of (nElements / nCores) or similar, in order to get all the cores working but also minimise pressure on the scheduler, inter-thread communication and so on.

> And what about single threaded version? Just ran the single thread examples after I moved time start before array allocation, thanks for that, good catch. Still better results in Java: - java: 21 secs, 612 ms - with std.math: dmd: 23 secs, 994 ms ldc: 31 secs, 668 ms gdc: 52 secs, 576 ms - with core.stdc.math: dmd: 30 secs, 724 ms ldc: 30 secs, 988 ms gdc: time: 25 secs, 970 ms

> Btw. I just noticed small issue with D vs. java, you start messure in D before allocation, but in case of Java after allocation Here is the java result for parallel processing after moving the start time as the first line in main. Still best result: 4 secs, 50 ms average

On Tuesday, 23 December 2014 at 12:26:28 UTC, Iov Gherman wrote: >> And what about single threaded version? > > Just ran the single thread examples after I moved time start before array allocation, thanks for that, good catch. Still better results in Java: > > - java: > 21 secs, 612 ms > > - with std.math: > dmd: 23 secs, 994 ms > ldc: 31 secs, 668 ms > gdc: 52 secs, 576 ms > > - with core.stdc.math: > dmd: 30 secs, 724 ms > ldc: 30 secs, 988 ms > gdc: time: 25 secs, 970 ms Note that log is done in software on x86 with different levels of precision and with different ability to handle corner cases. It is therefore a very bad benchmark tool.

Forums