Jump to page: 1 25  
Page
Thread overview
Replacing C's memcpy with a D implementation
Jun 10, 2018
Mike Franklin
Jun 10, 2018
Nicholas Wilson
Jun 10, 2018
Mike Franklin
Jun 10, 2018
Adam D. Ruppe
Jun 10, 2018
Mike Franklin
Jun 10, 2018
Kagamin
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Walter Bright
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Walter Bright
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Walter Bright
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Johannes Pfau
Jun 12, 2018
Mike Franklin
Jun 11, 2018
Walter Bright
Jun 12, 2018
Walter Bright
Jun 10, 2018
Mike Franklin
Jun 10, 2018
rikki cattermole
Jun 10, 2018
Seb
Jun 10, 2018
I love Ice Cream
Jun 10, 2018
Walter Bright
Jun 11, 2018
Patrick Schluter
Jun 11, 2018
Walter Bright
Jun 17, 2018
David Nadlinger
Jun 10, 2018
Guillaume Piolat
Jun 10, 2018
Mike Franklin
Jun 10, 2018
David Nadlinger
Jun 10, 2018
Walter Bright
Jun 10, 2018
Temtaime
Jun 10, 2018
David Nadlinger
Jun 11, 2018
Walter Bright
Jun 10, 2018
Walter Bright
Jun 10, 2018
solidstate1991
Jun 11, 2018
Mike Franklin
Jun 11, 2018
Basile B.
Jun 11, 2018
Basile B.
Jun 11, 2018
Basile B.
Jun 11, 2018
Walter Bright
Jun 11, 2018
Mike Franklin
Jun 17, 2018
David Nadlinger
Jun 11, 2018
Guillaume Piolat
Jun 11, 2018
Walter Bright
June 10, 2018
I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.

The prevailing wisdom has been that there is no way to improve on C's memcpy implementation given that it has been mirco-optimized to death over several decades by many talented members of the human race.

So, I threw the following benchmark together to try to get a clue about what I was up against, and in a very short time, I beat the snot of C's memcpy.  The benefit seems to disappear as the array sizes increase, but I believe the vast majority of calls to memcpy are probably quite small.

import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;
ubyte[length] dst;
ubyte[length] src;
auto rnd = Random(42);
ubyte[] src2;
ubyte[] dst2;

void verifyResults()
{
    assert(memcmp(dst.ptr, src.ptr, length) == 0);
    assert(memcmp(dst2.ptr, src2.ptr, length) == 0);
}

void randomizeData()
{
    for(int i = 0; i < length; i++)
    {
        src[i] = uniform!ubyte;
        dst[i] = uniform!ubyte;
    }

    src2 = src;
    dst2 = dst;
}

void memcpyD()
{
    dst = src.dup;
}

void memcpyDstdAlg()
{
    copy(src2, dst2);
}

void memcpyC()
{
    memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive()
{
    for(int i = 0; i < length; i++)
    {
        dst[i] = src[i];
    }
}

void memcpyASM()
{
    auto s = src.ptr;
    auto d = dst.ptr;
    size_t len = length;
    asm pure nothrow @nogc
    {
        mov RSI, s;
        mov RDI, d;
        cld;
        mov RCX, len;
        rep;
        movsb;
    }
}

void main()
{
    // verify the integrity of the algorithm
    randomizeData();
    memcpyD();
    verifyResults();

    randomizeData();
    memcpyDstdAlg();
    verifyResults();

    randomizeData();
    memcpyC();
    verifyResults();

    randomizeData();
    memcpyNaive();
    verifyResults();

    randomizeData();
    memcpyASM();
    verifyResults();

    // test the performance of the algorithm
    auto r = benchmark!(memcpyD, memcpyDstdAlg, memcpyC, memcpyNaive, memcpyASM)(1000);
    Duration memcpyDResult = r[0];
    Duration memcpyDstdAlgResult = r[1];
    Duration memcpyCResult = r[2];
    Duration memcpyNaiveResult = r[3];
    Duration memcpyASMResult = r[4];

    writeln("memcpyD: ", memcpyDResult);
    writeln("memcpyDstdAlg: ", memcpyDstdAlgResult);
    writeln("memcpyC: ", memcpyCResult);
    writeln("memcpyNaive: ", memcpyNaiveResult);
    writeln("memcpyASM: ", memcpyASMResult);
}


------ Output --------
memcpyD:         1 ms, 772 μs, and 4 hnsecs
memcpyDstdAlg: 531 μs and 8 hnsecs
memcpyC:       371 μs and 3 hnsecs
memcpyNaive:    21 ms, 572 μs, and 2 hnsecs
memcpyASM:     119 μs and 6 hnsecs



I'm not experienced with this kind of programming, so I'm doubting these results.  Have I done something wrong?  Am I overlooking something?

Thanks,
Mike

June 10, 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.
>
> [...]

what compiler? what flags?
June 10, 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.

So keep in mind that memcpy is really a magical intrinsic anyway and optimzers frequently don't actually call a function, but rather see the size and replace the instructions inline (like it might replace it with just a couple movs instead of something fancy).

And D already has it built in as well for @safe etc:

arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic

so be sure to check against that too.


June 10, 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> I'm not experienced with this kind of programming, so I'm doubting these results.  Have I done something wrong?  Am I overlooking something?


Hi,

I've spent a lot of time optimizing memcpy. One of the result was that on Intel ICC the compiler intrinsics were unbeatable.
Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM.

The reasoning is that small areas of memory (statically known) will be elided to a few byte moves (inlined), whereas the larger one will call the highly optimized C stdlib calls.
If you use ASM instead of IR the optimization barriers and register spilling will make you probably less efficient.
June 10, 2018
On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:
> On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
>> I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations.  There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from @safe code.
>>
>> [...]
>
> what compiler? what flags?

dmd main.d
Arch Linux x86_64

DMD, with no flags.  I'm not compiling C's memcpy, and the significant D implementation is memcpyASM which is all inline assembly, so I don't see how the compiler flags will make any difference, anyway.

Mike
June 10, 2018
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

> And D already has it built in as well for @safe etc:
>
> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic
>
> so be sure to check against that too.

My intent is to use the D implementation in the druntime.  I'm trying to replace the runtime hooks with templates, so we don't have to rely so much on `TypeInfo` and we can begin to use more of D without linking in the runtime.

But one think I discovered is that while we can set an array's length in @safe, nothrow, pure code, it gets lowered to a runtime hook that is neither @safe, nothrow, nor pure; the compiler is lying to us.  If I replace the runtime hook with a template, I need to be honest, and that means all that low-level code in druntime that the runtime hook depends on needs to be @safe, nothrow, and pure.  So I'm starting with the fundamental dependencies on the C library, trying to ensure they can be replaced with D implementations and use in D idiomatically.  For example, memcpy might look like this.

void memcpy(T)(T[] dest, T[] src);

We can extract size, alignment etc.. at compile-time.

So, my goal is not for user-facing code, but for improving the druntime implementations.

Does that make sense?

Mike
June 10, 2018
On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:

> Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM.

I tested with ldc and got similar results.  I thought the implementation in C forwarded to the backend intrinsic.  I think even LDC links with GCC, and the distribution of GCC is tuned for a given platform and architecture, so am I not already utilizing GCC's backend intrinsic.

Also, I'm not trying to find the holy grail of memcpy with this endeavor.  I just want to find something as good or better than what the druntime is currently using, but implemented in D.  Even if it's not optimal, if it's as good as what druntime is currently using, I want to replace those calls in the runtime with a D implementation.

Mike


June 10, 2018
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic

void memcpyD()
{
    dst = src.dup;
}

void memcpyD2()
{
    dst[] = src[];
}

-----
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

btw, what's the difference between the two.  If you can't tell, I'm actually not a very good D programmer.

Mike



June 11, 2018
On 11/06/2018 1:45 AM, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
> 
>> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic
> 
> void memcpyD()
> {
>      dst = src.dup;

malloc (for slice not static array)

> }
> 
> void memcpyD2()
> {
>      dst[] = src[];

memcpy

> }
> 
> -----
> memcpyD: 1 ms, 725 μs, and 1 hnsec
> memcpyD2: 587 μs and 5 hnsecs
> memcpyASM: 119 μs and 5 hnsecs
> 
> Still, the ASM version is much faster.
> 
> btw, what's the difference between the two.  If you can't tell, I'm actually not a very good D programmer.
> 
> Mike
> 
> 
> 

June 10, 2018
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
>
>> arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic
>
> void memcpyD()
> {
>     dst = src.dup;
> }
>
> void memcpyD2()
> {
>     dst[] = src[];
> }
>
> -----
> memcpyD: 1 ms, 725 μs, and 1 hnsec
> memcpyD2: 587 μs and 5 hnsecs
> memcpyASM: 119 μs and 5 hnsecs
>
> Still, the ASM version is much faster.
>
> btw, what's the difference between the two.  If you can't tell, I'm actually not a very good D programmer.
>
> Mike

I would increase the test data size, s.t. you get a least a few seconds.
Otherwise the benchmarking won't tell you much because it's subject to too much randomness.
« First   ‹ Prev
1 2 3 4 5