Replacing C's memcpy with a D implementation - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Replacing C's memcpy with a D implementation

Thread overview

Replacing C's memcpy with a D implementation
Jun 10, 2018 Mike Franklin
Jun 10, 2018 Nicholas Wilson
Jun 10, 2018 Mike Franklin
Jun 10, 2018 Adam D. Ruppe
Jun 10, 2018 Mike Franklin
Jun 10, 2018 Kagamin
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Walter Bright
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Walter Bright
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Walter Bright
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Johannes Pfau
Jun 12, 2018 Mike Franklin
Jun 11, 2018 Steven Schveighoffer
Jun 11, 2018 Walter Bright
Jun 12, 2018 Walter Bright
Jun 10, 2018 Mike Franklin
Jun 10, 2018 rikki cattermole
Jun 10, 2018 Seb
Jun 10, 2018 I love Ice Cream
Jun 10, 2018 Walter Bright
Jun 11, 2018 Patrick Schluter
Jun 11, 2018 Walter Bright
Jun 17, 2018 David Nadlinger
Jun 10, 2018 Guillaume Piolat
Jun 10, 2018 Mike Franklin
Jun 10, 2018 David Nadlinger
Jun 10, 2018 Walter Bright
Jun 10, 2018 Temtaime
Jun 10, 2018 David Nadlinger
Jun 11, 2018 Walter Bright
Jun 11, 2018 Nick Sabalausky (Abscissa)
Jun 10, 2018 Walter Bright
Jun 10, 2018 solidstate1991
Jun 11, 2018 Mike Franklin
Jun 11, 2018 Basile B.
Jun 11, 2018 Basile B.
Jun 11, 2018 Basile B.
Jun 11, 2018 Walter Bright
Jun 11, 2018 Mike Franklin
Jun 17, 2018 David Nadlinger
Jun 11, 2018 Guillaume Piolat
Jun 11, 2018 Walter Bright

June 10, 2018

Replacing C's memcpy with a D implementation

Posted by Mike Franklin

Mike Franklin

I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.

The prevailing wisdom has been that there is no way to improve on C's memcpy implementation given that it has been mirco-optimized to death over several decades by many talented members of the human race.

So, I threw the following benchmark together to try to get a clue about what I was up against, and in a very short time, I beat the snot of C's memcpy.  The benefit seems to disappear as the array sizes increase, but I believe the vast majority of calls to memcpy are probably quite small.

import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;
ubyte[length] dst;
ubyte[length] src;
auto rnd = Random(42);
ubyte[] src2;
ubyte[] dst2;

void verifyResults()
{
    assert(memcmp(dst.ptr, src.ptr, length) == 0);
    assert(memcmp(dst2.ptr, src2.ptr, length) == 0);
}

void randomizeData()
{
    for(int i = 0; i < length; i++)
    {
        src[i] = uniform!ubyte;
        dst[i] = uniform!ubyte;
    }

    src2 = src;
    dst2 = dst;
}

void memcpyD()
{
    dst = src.dup;
}

void memcpyDstdAlg()
{
    copy(src2, dst2);
}

void memcpyC()
{
    memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive()
{
    for(int i = 0; i < length; i++)
    {
        dst[i] = src[i];
    }
}

void memcpyASM()
{
    auto s = src.ptr;
    auto d = dst.ptr;
    size_t len = length;
    asm pure nothrow @nogc
    {
        mov RSI, s;
        mov RDI, d;
        cld;
        mov RCX, len;
        rep;
        movsb;
    }
}

void main()
{
    // verify the integrity of the algorithm
    randomizeData();
    memcpyD();
    verifyResults();

    randomizeData();
    memcpyDstdAlg();
    verifyResults();

    randomizeData();
    memcpyC();
    verifyResults();

    randomizeData();
    memcpyNaive();
    verifyResults();

    randomizeData();
    memcpyASM();
    verifyResults();

    // test the performance of the algorithm
    auto r = benchmark!(memcpyD, memcpyDstdAlg, memcpyC, memcpyNaive, memcpyASM)(1000);
    Duration memcpyDResult = r[0];
    Duration memcpyDstdAlgResult = r[1];
    Duration memcpyCResult = r[2];
    Duration memcpyNaiveResult = r[3];
    Duration memcpyASMResult = r[4];

    writeln("memcpyD: ", memcpyDResult);
    writeln("memcpyDstdAlg: ", memcpyDstdAlgResult);
    writeln("memcpyC: ", memcpyCResult);
    writeln("memcpyNaive: ", memcpyNaiveResult);
    writeln("memcpyASM: ", memcpyASMResult);
}


------ Output --------
memcpyD:         1 ms, 772 μs, and 4 hnsecs
memcpyDstdAlg: 531 μs and 8 hnsecs
memcpyC:       371 μs and 3 hnsecs
memcpyNaive:    21 ms, 572 μs, and 2 hnsecs
memcpyASM:     119 μs and 6 hnsecs



I'm not experienced with this kind of programming, so I'm doubting these results.  Have I done something wrong?  Am I overlooking something?

Thanks,
Mike

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Nicholas Wilson
in reply to Mike Franklin

Nicholas Wilson

Posted in reply to Mike Franklin

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.
>
> [...]

what compiler? what flags?

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Adam D. Ruppe
in reply to Mike Franklin

Adam D. Ruppe

Posted in reply to Mike Franklin

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.

So keep in mind that memcpy is really a magical intrinsic anyway and optimzers frequently don't actually call a function, but rather see the size and replace the instructions inline (like it might replace it with just a couple movs instead of something fancy).

And D already has it built in as well for @safe etc:

arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic

so be sure to check against that too.

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Guillaume Piolat
in reply to Mike Franklin

Guillaume Piolat

Posted in reply to Mike Franklin

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> I'm not experienced with this kind of programming, so I'm doubting these results.  Have I done something wrong?  Am I overlooking something?


Hi,

I've spent a lot of time optimizing memcpy. One of the result was that on Intel ICC the compiler intrinsics were unbeatable.
Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM.

The reasoning is that small areas of memory (statically known) will be elided to a few byte moves (inlined), whereas the larger one will call the highly optimized C stdlib calls.
If you use ASM instead of IR the optimization barriers and register spilling will make you probably less efficient.

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Mike Franklin
in reply to Nicholas Wilson

Mike Franklin

Posted in reply to Nicholas Wilson

On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:
> On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
>> I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.
>>
>> [...]
>
> what compiler? what flags?

dmd main.d
Arch Linux x86_64

DMD, with no flags.  I'm not compiling C's memcpy, and the significant D implementation is memcpyASM which is all inline assembly, so I don't see how the compiler flags will make any difference, anyway.

Mike

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Mike Franklin
in reply to Adam D. Ruppe

Mike Franklin

Posted in reply to Adam D. Ruppe

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

> And D already has it built in as well for @safe etc:
>
> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic
>
> so be sure to check against that too.

My intent is to use the D implementation in the druntime.  I'm trying to replace the runtime hooks with templates, so we don't have to rely so much on `TypeInfo` and we can begin to use more of D without linking in the runtime.

But one think I discovered is that while we can set an array's length in @safe, nothrow, pure code, it gets lowered to a runtime hook that is neither @safe, nothrow, nor pure; the compiler is lying to us.  If I replace the runtime hook with a template, I need to be honest, and that means all that low-level code in druntime that the runtime hook depends on needs to be @safe, nothrow, and pure.  So I'm starting with the fundamental dependencies on the C library, trying to ensure they can be replaced with D implementations and use in D idiomatically.  For example, memcpy might look like this.

void memcpy(T)(T[] dest, T[] src);

We can extract size, alignment etc.. at compile-time.

So, my goal is not for user-facing code, but for improving the druntime implementations.

Does that make sense?

Mike

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Mike Franklin
in reply to Guillaume Piolat

Mike Franklin

Posted in reply to Guillaume Piolat

On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:

> Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM.

I tested with ldc and got similar results.  I thought the implementation in C forwarded to the backend intrinsic.  I think even LDC links with GCC, and the distribution of GCC is tuned for a given platform and architecture, so am I not already utilizing GCC's backend intrinsic.

Also, I'm not trying to find the holy grail of memcpy with this endeavor.  I just want to find something as good or better than what the druntime is currently using, but implemented in D.  Even if it's not optimal, if it's as good as what druntime is currently using, I want to replace those calls in the runtime with a D implementation.

Mike

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Mike Franklin
in reply to Adam D. Ruppe

Mike Franklin

Posted in reply to Adam D. Ruppe

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic

void memcpyD()
{
    dst = src.dup;
}

void memcpyD2()
{
    dst[] = src[];
}

-----
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

btw, what's the difference between the two.  If you can't tell, I'm actually not a very good D programmer.

Mike

June 11, 2018

Re: Replacing C's memcpy with a D implementation

Posted by rikki cattermole
in reply to Mike Franklin

rikki cattermole

Posted in reply to Mike Franklin

On 11/06/2018 1:45 AM, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
> 
>> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic
> 
> void memcpyD()
> {
>      dst = src.dup;

malloc (for slice not static array)

> }
> 
> void memcpyD2()
> {
>      dst[] = src[];

memcpy

> }
> 
> -----
> memcpyD: 1 ms, 725 μs, and 1 hnsec
> memcpyD2: 587 μs and 5 hnsecs
> memcpyASM: 119 μs and 5 hnsecs
> 
> Still, the ASM version is much faster.
> 
> btw, what's the difference between the two.  If you can't tell, I'm actually not a very good D programmer.
> 
> Mike
> 
> 
>

June 10, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Seb
in reply to Mike Franklin

Seb

Posted in reply to Mike Franklin

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
>
>> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic
>
> void memcpyD()
> {
>     dst = src.dup;
> }
>
> void memcpyD2()
> {
>     dst[] = src[];
> }
>
> -----
> memcpyD: 1 ms, 725 μs, and 1 hnsec
> memcpyD2: 587 μs and 5 hnsecs
> memcpyASM: 119 μs and 5 hnsecs
>
> Still, the ASM version is much faster.
>
> btw, what's the difference between the two.  If you can't tell, I'm actually not a very good D programmer.
>
> Mike

I would increase the test data size, s.t. you get a least a few seconds.
Otherwise the benchmarking won't tell you much because it's subject to too much randomness.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation