Thread overview
drastic slowdown for copies
May 28, 2015
Momo
May 28, 2015
Adam D. Ruppe
May 28, 2015
Timon Gehr
May 28, 2015
Momo
May 29, 2015
Momo
May 29, 2015
Ali Çehreli
May 29, 2015
thedeemon
May 29, 2015
thedeemon
May 29, 2015
thedeemon
May 29, 2015
Momo
May 28, 2015
I'm currently investigating the difference of speed between references and copies. And it seems that copies got a immense slowdown if they reach a size of >= 20 bytes.
In the code below you can see if my struct has a size of < 20 bytes (e.g. 4 ints = 16 bytes) a copy is cheaper than a reference. But with 5 ints (= 20 bytes) it gets a slowdown of ~3 times. I got these results:

16 bytes:
by ref: 49
by copy: 34
by move: 32

20 bytes:
by ref: 51
by copy: 104
by move: 103

My question is: why?

My system is Win 8.1, 64 Bit and I'm using dmd 2.067.1 (32 bit)

Code:

import std.stdio;
import std.datetime;

struct S {
    int[4] values;
}

pragma(msg, S.sizeof);

void by_ref(ref const S s) {

}

void by_copy(const S s) {

}

enum size_t Loops = 10_000_000;

void main() {
    StopWatch sw;

    sw.start();
    for (size_t i = 0; i < Loops; i++) {
        S s = S();
        by_ref(s);
    }
    sw.stop();

    writeln("by ref: ", sw.peek().msecs);

    sw.reset();

    sw.start();
    for (size_t i = 0; i < Loops; i++) {
        S s = S();
        by_copy(s);
    }
    sw.stop();

    writeln("by copy: ", sw.peek().msecs);

    sw.reset();

    sw.start();
    for (size_t i = 0; i < Loops; i++) {
        by_copy(S());
    }
    sw.stop();

    writeln("by move: ", sw.peek().msecs);
}
May 28, 2015
16 bytes is 64 bit - the same size as a reference. So copying it is overall a bit less work - sending a 64 bit struct is as small as a 64 bit reference and you don't go through the pointer.

So up to them, it is a bit faster.


Add another byte and now the copy is too big to fit in a register, so it needs to spill over into somewhere else which means a bunch more work for the cpu.
May 28, 2015
On 05/28/2015 11:27 PM, Adam D. Ruppe wrote:
> 16 bytes is 64 bit

It's actually 128 bits.
May 28, 2015
On Thursday, 28 May 2015 at 21:27:42 UTC, Adam D. Ruppe wrote:
> 16 bytes is 64 bit - the same size as a reference. So copying it is overall a bit less work - sending a 64 bit struct is as small as a 64 bit reference and you don't go through the pointer.
>
> So up to them, it is a bit faster.
>
>
> Add another byte and now the copy is too big to fit in a register, so it needs to spill over into somewhere else which means a bunch more work for the cpu.

But even in release mode (and with optimizations turned on) it is > 3 times slower. Can I somehow enforce references, like in C++? I tried already in ref, const ref and immutable ref, nothing works.
May 29, 2015
On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote:
> I'm currently investigating the difference of speed between references and copies. And it seems that copies got a immense slowdown if they reach a size of >= 20 bytes.

This is processor-specific, on different models of CPUs you might get different results. Here's what I see running your program with 4 and 5 ints in the struct:

C:\prog\D>dmd copyref.d -ofcopyref.exe -release -O -inline
16u

C:\prog\D>copyref.exe
by ref: 18
by copy: 85
by move: 84

C:\prog\D>copyref.exe
by ref: 18
by copy: 72
by move: 72

C:\prog\D>copyref.exe
by ref: 16
by copy: 72
by move: 72

C:\prog\D>dmd copyref.d -ofcopyref.exe -release -O -inline
20u

C:\prog\D>copyref.exe
by ref: 23
by copy: 98
by move: 91

C:\prog\D>copyref.exe
by ref: 20
by copy: 91
by move: 102

C:\prog\D>copyref.exe
by ref: 23
by copy: 91
by move: 91

I see these digits on an old Core 2 Quad and very similar on a Core i3. So your findings are not reproducible.
May 29, 2015
On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote:

Ah, actually it's more complicated, as it depends on inlining a lot.
Indeed, without -O and -inline I was able to get by_ref to be slightly slower than by_copy for struct of 4 ints. But when inlining turns on, the numbers change in different directions. And for 5 ints inlining influence is quite different:

4 ints:             5 ints:
-release
by ref: 53          by ref: 53
by copy: 57         by copy: 137
by move: 54         by move: 137

-release -O
by ref: 38          by ref: 34
by copy: 54         by copy: 137
by move: 49         by move: 137

-release -O -inline
by ref: 15          by ref: 20
by copy: 72         by copy: 91
by move: 72         by move: 91

May 29, 2015
On Friday, 29 May 2015 at 07:51:31 UTC, thedeemon wrote:

Above was on Core 2 Quad,

here's for Core i3:

4 ints          5 ints
-release
by ref: 67      by ref: 66
by copy: 44     by copy: 142
by move: 45     by move: 137

-release -O
by ref: 29      by ref: 29
by copy: 41     by copy: 141
by move: 40     by move: 142

-release -O -inline
by ref: 16      by ref: 20
by copy: 83     by copy: 104
by move: 83     by move: 104
May 29, 2015
On Friday, 29 May 2015 at 07:51:31 UTC, thedeemon wrote:
> On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote:
>
> Ah, actually it's more complicated, as it depends on inlining a lot.
Yes. And real functions are more complex and inlining is no reliable option.
> Indeed, without -O and -inline I was able to get by_ref to be slightly slower than by_copy for struct of 4 ints. But when inlining turns on, the numbers change in different directions. And for 5 ints inlining influence is quite different:
>
> 4 ints:             5 ints:
> -release
> by ref: 53          by ref: 53
> by copy: 57         by copy: 137
> by move: 54         by move: 137
>
> -release -O
> by ref: 38          by ref: 34
> by copy: 54         by copy: 137
> by move: 49         by move: 137
>
> -release -O -inline
> by ref: 15          by ref: 20
> by copy: 72         by copy: 91
> by move: 72         by move: 91

So as you can see, it is 2-3 times slower. Is there an alternative?
May 29, 2015
Perhaps you can give me another detailed answer.
I get a slowdown for all parts (ref, copy and move) if I use uninitialized floats. I got these results from the following code:

by ref:  2369
by copy: 2335
by move: 2341

Code:

struct vec2f {
    float x;
    float y;
}

But if I assign 0 to them I got these results:

by ref:  49
by copy: 22
by move: 25

Why?
May 29, 2015
On 05/29/2015 06:55 AM, Momo wrote:
> Perhaps you can give me another detailed answer.
> I get a slowdown for all parts (ref, copy and move) if I use
> uninitialized floats.

Floating point variables are initialized to .nan of their types (e.g. float.nan). Apparently, the CPU is slow when using those special values:


http://stackoverflow.com/questions/3606054/how-slow-is-nan-arithmetic-in-the-intel-x64-fpu

Ali