June 10, 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.

In safe code you just use assignment and array ops, backend does the rest.

On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
> But one think I discovered is that while we can set an array's length in @safe, nothrow, pure code, it gets lowered to a runtime hook that is neither @safe, nothrow, nor pure; the compiler is lying to us.

If the compiler can't get it right then who can?
June 10, 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> I'm not experienced with this kind of programming, so I'm doubting these results.  Have I done something wrong?  Am I overlooking something?

You've just discovered the fact that one can rarely be careful enough with what is benchmarked, and having enough statistics.

For example, check out the following output from running your program on macOS 10.12, compiled with LDC 1.8.0:

---
$ ./test
memcpyD: 2 ms, 570 μs, and 9 hnsecs
memcpyDstdAlg: 77 μs and 2 hnsecs
memcpyC: 74 μs and 1 hnsec
memcpyNaive: 76 μs and 4 hnsecs
memcpyASM: 145 μs and 5 hnsecs
$ ./test
memcpyD: 3 ms and 376 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 104 μs and 4 hnsecs
memcpyNaive: 72 μs and 2 hnsecs
memcpyASM: 181 μs and 8 hnsecs
$ ./test
memcpyD: 2 ms and 565 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 73 μs and 2 hnsecs
memcpyNaive: 71 μs and 9 hnsecs
memcpyASM: 145 μs and 3 hnsecs
$ ./test
memcpyD: 2 ms, 813 μs, and 8 hnsecs
memcpyDstdAlg: 81 μs and 2 hnsecs
memcpyC: 99 μs and 2 hnsecs
memcpyNaive: 74 μs and 2 hnsecs
memcpyASM: 149 μs and 1 hnsec
$ ./test
memcpyD: 2 ms, 593 μs, and 7 hnsecs
memcpyDstdAlg: 77 μs and 3 hnsecs
memcpyC: 75 μs
memcpyNaive: 77 μs and 2 hnsecs
memcpyASM: 145 μs and 5 hnsecs
---

Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest, followed by the ASM implementation.

In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.

 — David


June 10, 2018
Don't C implementations already do 90% of what you want? I thought most compilers know about and optimize these methods based on context. I thought they were *special* in the eyes of the compiler already. I think you are fighting a battle pitting 40 years of tweaking against you...
June 10, 2018
On 6/10/2018 5:49 AM, Mike Franklin wrote:
> [...]

One source of entropy in the results is src and dst being global variables. Global variables in D are in TLS, and TLS access can be complex (many instructions) and is influenced by the -fPIC switch. Worse, global variable access is not optimized in dmd because of aliasing problems.

The solution is to pass src, dst, and length to the copy function as function parameters (and make sure function inlining is off).

In light of this, I want to BEAT THE DEAD HORSE once again and assert that if the assembler generated by a benchmark is not examined, the results can be severely misleading. I've seen this happen again and again. In this case, TLS access is likely being benchmarked, not memcpy.

BTW, the relative timing of rep movsb can be highly dependent on which CPU chip you're using.
June 10, 2018
On 6/10/2018 6:45 AM, Mike Franklin wrote:
>  void memcpyD()
> {
>      dst = src.dup;
> }

Note that .dup is doing a GC memory allocation.
June 10, 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> void memcpyASM()
> {
>     auto s = src.ptr;
>     auto d = dst.ptr;
>     size_t len = length;
>     asm pure nothrow @nogc
>     {
>         mov RSI, s;
>         mov RDI, d;
>         cld;
>         mov RCX, len;
>         rep;
>         movsb;
>     }
> }
Protip: Use SSE or AVX for an even faster copying.
June 10, 2018
On 6/10/2018 11:16 AM, David Nadlinger wrote:
> Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,

Probably because it does a memory allocation.


> followed by the ASM implementation.

The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.


> In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.

This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.
June 10, 2018
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
> On 6/10/2018 11:16 AM, David Nadlinger wrote:
>> Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,
>
> Probably because it does a memory allocation.
>
>
>> followed by the ASM implementation.
>
> The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.
>
>
>> In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.
>
> This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.

On some cpu architectures(for example intel atoms) rep movsb is the fatest memcpy.
June 10, 2018
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
> On 6/10/2018 11:16 AM, David Nadlinger wrote:
>> Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,
>
> Probably because it does a memory allocation.

Of course; that was already pointed out earlier in the thread.

> The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.

That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cache lines. I've never measured that myself, though.

 — David
June 10, 2018
On 6/10/2018 4:39 PM, David Nadlinger wrote:
> That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cache 

The drama of which instruction mix is faster on which CPU never abates!