June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Franklin | I've modified the test based on the feedback so far, so here's what it looks like now: import std.datetime.stopwatch; import std.stdio; import core.stdc.string; import std.random; import std.algorithm; enum length = 4096 * 2; void init(ref ubyte[] a) { a.length = length; for(int i = 0; i < length; i++) { a[i] = uniform!ubyte; } } void verifyResults(ubyte[] a, ubyte[] b) { assert(memcmp(a.ptr, b.ptr, length) == 0); } void memcpyD(ubyte[] dst, ubyte[] src) { dst[] = src[]; } void memcpyDstdAlg(ubyte[] dst, ubyte[] src) { copy(src, dst); } void memcpyC(ubyte[] dst, ubyte[] src) { memcpy(dst.ptr, src.ptr, length); } void memcpyNaive(ubyte[] dst, ubyte[] src) { for(int i = 0; i < length; i++) { dst[i] = src[i]; } } void memcpyASM(ubyte[] dst, ubyte[] src) { auto s = src.ptr; auto d = dst.ptr; size_t len = length; asm pure nothrow @nogc { mov RSI, s; mov RDI, d; cld; mov RCX, len; rep; movsb; } } Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n) { Duration result; auto sw = StopWatch(AutoStart.yes); sw.reset(); foreach (_; 0 .. n) { f(dst, src); } result = sw.peek(); return result; } void main() { ubyte[] src; ubyte[] dst; // verify the integrity of the algorithm init(src); init(dst); memcpyD(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyDstdAlg(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyC(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyNaive(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyASM(dst, src); verifyResults(dst, src); // test the performance of the algorithm enum iterations = 1000; writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations)); writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, iterations)); writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations)); writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, iterations)); writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, iterations)); } The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz): memcpyD: 127 ╬╝s and 3 hnsecs memcpyDstdAlg: 195 ╬╝s and 9 hnsecs memcpyC: 126 ╬╝s and 7 hnsecs memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs memcpyASM: 122 ╬╝s and 8 hnsecs (Gotta love how windows displays μ) The results running on Arch Linux 64-bit in a VirtualBox on the same Windows 10 machine: memcpyD: 409 μs memcpyDstdAlg: 400 μs memcpyC: 404 μs and 4 hnsecs memcpyNaive: 17 ms, 251 μs, and 6 hnsecs memcpyASM: 162 μs and 8 hnsecs The results appear more sane now, but it seems the behavior is highly platform dependent. Still the ASM is doing well for my hardware. If I run the test multiple times, I do see a lot of noise in the results, but each test seems to be affected proportionally, so I'm gaining a little more confidence in the benchmark. I still need to analyze the assembly of C's memcpy (anyone know where I can find the source code?), test on more platforms, and test varying sizes, but I'm just collecting some initial data right now, to learn how to proceed. I'd be interested in those with other platforms reporting back their results for their hardware, and of course suggestions for how to meet or beat C's memcpy with a pure D implementation. Thanks for all the feedback so far. Mike |
June 10, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On 06/10/2018 08:01 PM, Walter Bright wrote:
> On 6/10/2018 4:39 PM, David Nadlinger wrote:
>> That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cache
>
> The drama of which instruction mix is faster on which CPU never abates!
In many ways, I really miss 80's machine architecture ;) So simple.
|
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote: > On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote: >> There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code. > > In safe code you just use assignment and array ops, backend does the rest. > > On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote: >> But one think I discovered is that while we can set an array's length in @safe, nothrow, pure code, it gets lowered to a runtime hook that is neither @safe, nothrow, nor pure; the compiler is lying to us. > > If the compiler can't get it right then who can? The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not @safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442 The backend is not involved. Mike |
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Franklin | On Monday, 11 June 2018 at 02:49:00 UTC, Mike Franklin wrote: > The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not @safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442 The backend is not involved. Also, understand that this lowering happens in the IR generation stage: https://github.com/dlang/dmd/blob/3a79629988efd51d4dda9edb38a6701cd097da89/src/dmd/e2ir.d#L2616 so semantic analysis is not performed on the runtime call. If it were, it would fail to compile. Mike |
June 10, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Franklin | On 6/10/2018 7:49 PM, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
>> If the compiler can't get it right then who can?
> The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not @safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442 The backend is not involved.
It's only faulty if there's a bug in it. It's essentially @trusted code.
|
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Franklin | On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote: > I've modified the test based on the feedback so far, so here's what it looks like now: > > import std.datetime.stopwatch; > import std.stdio; > import core.stdc.string; > import std.random; > import std.algorithm; > > enum length = 4096 * 2; > > void init(ref ubyte[] a) > { > a.length = length; > > for(int i = 0; i < length; i++) > { > a[i] = uniform!ubyte; > } > } > > void verifyResults(ubyte[] a, ubyte[] b) > { > assert(memcmp(a.ptr, b.ptr, length) == 0); > } > > void memcpyD(ubyte[] dst, ubyte[] src) > { > dst[] = src[]; > } > > void memcpyDstdAlg(ubyte[] dst, ubyte[] src) > { > copy(src, dst); > } > > void memcpyC(ubyte[] dst, ubyte[] src) > { > memcpy(dst.ptr, src.ptr, length); > } > > void memcpyNaive(ubyte[] dst, ubyte[] src) > { > for(int i = 0; i < length; i++) > { > dst[i] = src[i]; > } > } > > void memcpyASM(ubyte[] dst, ubyte[] src) > { > auto s = src.ptr; > auto d = dst.ptr; > size_t len = length; > asm pure nothrow @nogc > { > mov RSI, s; > mov RDI, d; > cld; > mov RCX, len; > rep; > movsb; > } > } > > Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n) > { > Duration result; > auto sw = StopWatch(AutoStart.yes); > > sw.reset(); > foreach (_; 0 .. n) > { > f(dst, src); > } > result = sw.peek(); > > return result; > } > > void main() > { > ubyte[] src; > ubyte[] dst; > > // verify the integrity of the algorithm > init(src); > init(dst); > memcpyD(dst, src); > verifyResults(dst, src); > > init(src); > init(dst); > memcpyDstdAlg(dst, src); > verifyResults(dst, src); > > init(src); > init(dst); > memcpyC(dst, src); > verifyResults(dst, src); > > init(src); > init(dst); > memcpyNaive(dst, src); > verifyResults(dst, src); > > init(src); > init(dst); > memcpyASM(dst, src); > verifyResults(dst, src); > > // test the performance of the algorithm > enum iterations = 1000; > writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations)); > writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, iterations)); > writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations)); > writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, iterations)); > writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, iterations)); > } > > The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz): > memcpyD: 127 ╬╝s and 3 hnsecs > memcpyDstdAlg: 195 ╬╝s and 9 hnsecs > memcpyC: 126 ╬╝s and 7 hnsecs > memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs > memcpyASM: 122 ╬╝s and 8 hnsecs > (Gotta love how windows displays μ) > > The results running on Arch Linux 64-bit in a VirtualBox on the same Windows 10 machine: > memcpyD: 409 μs > memcpyDstdAlg: 400 μs > memcpyC: 404 μs and 4 hnsecs > memcpyNaive: 17 ms, 251 μs, and 6 hnsecs > memcpyASM: 162 μs and 8 hnsecs > > The results appear more sane now, but it seems the behavior is highly platform dependent. Still the ASM is doing well for my hardware. If I run the test multiple times, I do see a lot of noise in the results, but each test seems to be affected proportionally, so I'm gaining a little more confidence in the benchmark. > > I still need to analyze the assembly of C's memcpy (anyone know where I can find the source code?), - default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c |
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Monday, 11 June 2018 at 03:31:05 UTC, Walter Bright wrote:
> On 6/10/2018 7:49 PM, Mike Franklin wrote:
>> On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
>>> If the compiler can't get it right then who can?
>> The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not @safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442 The backend is not involved.
>
> It's only faulty if there's a bug in it. It's essentially @trusted code.
That only addresses the @safe attribute, and that code is much too complex for anyone to audit it and certify it as safe.
Exceptions are also not all handled, so there is no way it can pass as nothrow.
The runtime call needs to be replaced with a template that can take advantage of D's compile-time type information instead of `TypeInfo`, but doing so will force the implementation through semantic processing, which, in it's current state, will fail to compile. It needs to be refactored, and creating @safe, nothrow, pure building blocks will help with that, hence the topic of this thread.
Mike
|
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Franklin | On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote: > On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote: > memcpyD: 1 ms, 725 μs, and 1 hnsec > memcpyD2: 587 μs and 5 hnsecs > memcpyASM: 119 μs and 5 hnsecs > > Still, the ASM version is much faster. > rep movsd is very CPU dependend and needs some precondtions to be fast. For relative short memory blocks it sucks on many other CPU than the last Intel. See what Agner Fog has to say about it: 16.10 String instructions (all processors) String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to LOOP on all processors and to JECXZ on some processors. REP MOVSD andREP STOSD are quite fast if the repeat count is not too small. Always use the largest word size possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make sure that both source and destination are aligned by the word size. In many cases, however, it is faster to use XMM registers. Moving data in XMM registers is faster than REP MOVSD and REP STOSD in most cases, especially on older processors. See page 164 for details. Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses on P2 and P3. In other words, you will get a penalty of one clock extra per iteration if ESI +WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16-bit mode. On many processors, REP MOVS and REP STOS can perform fast by moving 16 bytes or an entire cache line at a time . This happens only when certain conditions are met. Depending on the processor, the conditions for fast string instructions are, typically, that the count must be high, both source and destination must be aligned, the direction must be forward, the distance between source and destination must be at least the cache line size, and the memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met). Under these conditions, the speed is as high as you can obtain with vector register moves or even faster on some processors. While the string instructions can be quite convenient, it must be emphasized that other solutions are faster in many cases. If the above conditions for fast move are not met then there is a lot to gain by using other methods. See page 164 for alternatives to REP MOVS |
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote: > On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote: >> [...] > > - default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C > - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c > - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c D slice style memcpy cg: https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526 |
June 11, 2018 Re: Replacing C's memcpy with a D implementation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote: > On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote: >> [...] > > - default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C > - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c > - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c D slice style memcpy cg: https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526 |
Copyright © 1999-2021 by the D Language Foundation