drastic slowdown for copies

May 28, 2015

Momo

May 28, 2015

Adam D. Ruppe

May 28, 2015

May 28, 2015

May 29, 2015

May 29, 2015

May 29, 2015

May 29, 2015

May 29, 2015

May 29, 2015

May 28, 2015

drastic slowdown for copies

Posted by Momo

Permalink

Momo

Permalink

I'm currently investigating the difference of speed between references and copies. And it seems that copies got a immense slowdown if they reach a size of >= 20 bytes.
In the code below you can see if my struct has a size of < 20 bytes (e.g. 4 ints = 16 bytes) a copy is cheaper than a reference. But with 5 ints (= 20 bytes) it gets a slowdown of ~3 times. I got these results:

16 bytes:
by ref: 49
by copy: 34
by move: 32

20 bytes:
by ref: 51
by copy: 104
by move: 103

My question is: why?

My system is Win 8.1, 64 Bit and I'm using dmd 2.067.1 (32 bit)

Code:

import std.stdio;
import std.datetime;

struct S {
    int[4] values;
}

pragma(msg, S.sizeof);

void by_ref(ref const S s) {

}

void by_copy(const S s) {

}

enum size_t Loops = 10_000_000;

void main() {
    StopWatch sw;

    sw.start();
    for (size_t i = 0; i < Loops; i++) {
        S s = S();
        by_ref(s);
    }
    sw.stop();

    writeln("by ref: ", sw.peek().msecs);

    sw.reset();

    sw.start();
    for (size_t i = 0; i < Loops; i++) {
        S s = S();
        by_copy(s);
    }
    sw.stop();

    writeln("by copy: ", sw.peek().msecs);

    sw.reset();

    sw.start();
    for (size_t i = 0; i < Loops; i++) {
        by_copy(S());
    }
    sw.stop();

    writeln("by move: ", sw.peek().msecs);
}

16 bytes is 64 bit - the same size as a reference. So copying it is overall a bit less work - sending a 64 bit struct is as small as a 64 bit reference and you don't go through the pointer. So up to them, it is a bit faster. Add another byte and now the copy is too big to fit in a register, so it needs to spill over into somewhere else which means a bunch more work for the cpu.

On Thursday, 28 May 2015 at 21:27:42 UTC, Adam D. Ruppe wrote: > 16 bytes is 64 bit - the same size as a reference. So copying it is overall a bit less work - sending a 64 bit struct is as small as a 64 bit reference and you don't go through the pointer. > > So up to them, it is a bit faster. > > > Add another byte and now the copy is too big to fit in a register, so it needs to spill over into somewhere else which means a bunch more work for the cpu. But even in release mode (and with optimizations turned on) it is > 3 times slower. Can I somehow enforce references, like in C++? I tried already in ref, const ref and immutable ref, nothing works.

On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote: > I'm currently investigating the difference of speed between references and copies. And it seems that copies got a immense slowdown if they reach a size of >= 20 bytes. This is processor-specific, on different models of CPUs you might get different results. Here's what I see running your program with 4 and 5 ints in the struct: C:\prog\D>dmd copyref.d -ofcopyref.exe -release -O -inline 16u C:\prog\D>copyref.exe by ref: 18 by copy: 85 by move: 84 C:\prog\D>copyref.exe by ref: 18 by copy: 72 by move: 72 C:\prog\D>copyref.exe by ref: 16 by copy: 72 by move: 72 C:\prog\D>dmd copyref.d -ofcopyref.exe -release -O -inline 20u C:\prog\D>copyref.exe by ref: 23 by copy: 98 by move: 91 C:\prog\D>copyref.exe by ref: 20 by copy: 91 by move: 102 C:\prog\D>copyref.exe by ref: 23 by copy: 91 by move: 91 I see these digits on an old Core 2 Quad and very similar on a Core i3. So your findings are not reproducible.

On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote: Ah, actually it's more complicated, as it depends on inlining a lot. Indeed, without -O and -inline I was able to get by_ref to be slightly slower than by_copy for struct of 4 ints. But when inlining turns on, the numbers change in different directions. And for 5 ints inlining influence is quite different: 4 ints: 5 ints: -release by ref: 53 by ref: 53 by copy: 57 by copy: 137 by move: 54 by move: 137 -release -O by ref: 38 by ref: 34 by copy: 54 by copy: 137 by move: 49 by move: 137 -release -O -inline by ref: 15 by ref: 20 by copy: 72 by copy: 91 by move: 72 by move: 91

On Friday, 29 May 2015 at 07:51:31 UTC, thedeemon wrote: Above was on Core 2 Quad, here's for Core i3: 4 ints 5 ints -release by ref: 67 by ref: 66 by copy: 44 by copy: 142 by move: 45 by move: 137 -release -O by ref: 29 by ref: 29 by copy: 41 by copy: 141 by move: 40 by move: 142 -release -O -inline by ref: 16 by ref: 20 by copy: 83 by copy: 104 by move: 83 by move: 104

On Friday, 29 May 2015 at 07:51:31 UTC, thedeemon wrote: > On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote: > > Ah, actually it's more complicated, as it depends on inlining a lot. Yes. And real functions are more complex and inlining is no reliable option. > Indeed, without -O and -inline I was able to get by_ref to be slightly slower than by_copy for struct of 4 ints. But when inlining turns on, the numbers change in different directions. And for 5 ints inlining influence is quite different: > > 4 ints: 5 ints: > -release > by ref: 53 by ref: 53 > by copy: 57 by copy: 137 > by move: 54 by move: 137 > > -release -O > by ref: 38 by ref: 34 > by copy: 54 by copy: 137 > by move: 49 by move: 137 > > -release -O -inline > by ref: 15 by ref: 20 > by copy: 72 by copy: 91 > by move: 72 by move: 91 So as you can see, it is 2-3 times slower. Is there an alternative?

Perhaps you can give me another detailed answer. I get a slowdown for all parts (ref, copy and move) if I use uninitialized floats. I got these results from the following code: by ref: 2369 by copy: 2335 by move: 2341 Code: struct vec2f { float x; float y; } But if I assign 0 to them I got these results: by ref: 49 by copy: 22 by move: 25 Why?

On 05/29/2015 06:55 AM, Momo wrote: > Perhaps you can give me another detailed answer. > I get a slowdown for all parts (ref, copy and move) if I use > uninitialized floats. Floating point variables are initialized to .nan of their types (e.g. float.nan). Apparently, the CPU is slow when using those special values: http://stackoverflow.com/questions/3606054/how-slow-is-nan-arithmetic-in-the-intel-x64-fpu Ali

Forums