May 29, 2019
On Wednesday, 29 May 2019 at 19:35:36 UTC, Jonathan Marler wrote:
>
> Yes that would be an answer, I guess I got confused when you mentioned CTFE and introspection, I wasn't sure if "benchmarks" was referring to those features or to runtime benchmarks.  And looks like @Mike posted the benchmarks on that github link you sent.
>
Great, you can see that in the benchmarks, memcpyD is faster than libc memcpy
except for sizes larger than 32768. We hope that we can surpass those
as well, as yesterday I did some simple inline SIMD things and got better performance in 32768.
But previous work is of course responsibility of Mike and those benchmarks
are in part because of inlining.

>
> It's true that if you can assume pointers are aligned on a particular boundary that you can be faster than memcpy which works with any alignment.  This must be what Mike is doing, though, I would then create only a few instances of memcpy that assume alignment on boundaries like 4, 8, 16.  And if you have a pointer or an array to a particular type, you can probably assume that pointer/array is aligned on that types's "alignof" property.
>

This is, as I said, the alignment guarrantee. I hope that I can get other benefits
from types also.
Also, hopefully we will do LDC / GDC specific things. Leverage the intrinsics for example.
I will put an update shortly, as the other students, explaining some of that, but I thought since we started it.. :p

> I think I will use this in my library.

Great! We hope that it will be useful and any feedback is appreciated!


May 29, 2019
On Wednesday, 29 May 2019 at 11:46:28 UTC, Stefanos Baziotis wrote:
> My initial pick was void memcpyD(T)(T* dst, const T* src), but it was proposed
> that `ref` instead of pointers might be better.

ref would only work when copying one instance at a time. Many times, you'll want to copy a contiguous array of a length only known at runtime (and definitely NOT invoke memcpy in a loop, so that the implementation can e.g. use SIMD streaming when copying gazillions of 32-bit pixels).

I'd suggest a structure similar to this, minimizing bloat:

// int a, b;            memcpyD(&a, &b);
// int[4] a, b;         memcpyD(&a, &b);
// int[16] a; int[4] b; memcpyD!4(&a[8], b.ptr);
void memcpyD(size_t length = 1, T)(T* dst, const T* src)
{
    pickBestImpl!(T.alignof, length * T.sizeof)(dst, src);
}

void memcpyD(T)(T* dst, const T* src, size_t length)
{
    pickBestImpl!(T.alignof)(dst, src, length * T.sizeof);
}

private:

/* These 2 will probably share most logic, the first one just exploiting a
 * static size. A common mixin might come in handy (e.g., switching from
 * runtime-if to static-if).
 */
void pickBestImpl(size_t alignment, size_t size)(void* dst, const void* src);
void pickBestImpl(size_t alignment)(void* dst, const void* src, size_t size);
May 29, 2019
On Wednesday, 29 May 2019 at 18:14:11 UTC, Jonathan Marler wrote:
> and then forward to the real implementation

With D you can forward to best suiting implementation. What libc does it performs various runtime checks in order to figure out what is the best way of copying provided input. With D it should be possible to make certain checks at compile time. Secondly C's memcopy is a big function not because its best for performance but because of convenience. With D we can have many smaller functions and they would be selected by template magic.

May 29, 2019
On Wednesday, 29 May 2019 at 20:28:18 UTC, Stefanos Baziotis wrote:
> On Wednesday, 29 May 2019 at 19:35:36 UTC, Jonathan Marler wrote:
>>[...]
> Great, you can see that in the benchmarks, memcpyD is faster than libc memcpy
> except for sizes larger than 32768. We hope that we can surpass those
> as well, as yesterday I did some simple inline SIMD things and got better performance in 32768.
> But previous work is of course responsibility of Mike and those benchmarks
> are in part because of inlining.
>
>>[...]
>
> This is, as I said, the alignment guarrantee. I hope that I can get other benefits
> from types also.
> Also, hopefully we will do LDC / GDC specific things. Leverage the intrinsics for example.
> I will put an update shortly, as the other students, explaining some of that, but I thought since we started it.. :p
>
>> [...]
>
> Great! We hope that it will be useful and any feedback is appreciated!

I haven't benchmarked it yet but here's the changes I've made to my standard library to also take advantage of alignment guarantees from typed pointers and arrays.

https://github.com/dragon-lang/mar/commit/bb096d2d4f489d47177f6a678b1d9bab756e3dc7

May 30, 2019
On Wednesday, 29 May 2019 at 20:50:45 UTC, kinke wrote:
>
> ref would only work when copying one instance at a time. Many times, you'll want to copy a contiguous array of a length only known at runtime (and definitely NOT invoke memcpy in a loop, so that the implementation can e.g. use SIMD streaming when copying gazillions of 32-bit pixels).
>

The current state is that we think that slices should be enough for this need.
Meaning, you don't need the third size parameter. In this case, ref is better. On the other, in other cases I think that pointers
are more intuitive. Again, of course the fact that _I_ think it is of little importance. That post was primarily made so that you, the community, can give
feedback on this.

Apart from that, I'm still sceptical about whether we should provide
a version with size..
May 30, 2019
On Wednesday, 29 May 2019 at 23:27:35 UTC, Jonathan Marler wrote:
>
> I haven't benchmarked it yet but here's the changes I've made to my standard library to also take advantage of alignment guarantees from typed pointers and arrays.
>
> https://github.com/dragon-lang/mar/commit/bb096d2d4f489d47177f6a678b1d9bab756e3dc7
>

Good, this week I'm also working on alignment. (more specifically, mis-alignment).
Since you took the time anyway to play with alignment, you might find
SIMD instructions useful.
Take a look at Mike's memcpyD. My yesterday toy SIMD that surpassed
libc memcpy was as simple as:

static foreach(i; 0 .. T.sizeof/32) {
    // Assuming RDI is 'dst' and RSI 'src'
    asm pure nothrow @nogc {
     	vmovdqa YMM0, [RDI+i*32];
        vmovdqa [RSI+i*32], YMM0;
    }
}
/* instead of
static foreach(i; 0 .. T.sizeof/32)
{
    memcpyD((cast(S!32*)dst) + i, (cast(const S!32*)src) + i);
}
*/

Again, really simple and dumb, but effective. A couple of notes, so that you
don't have the headaches I had:
1) You can use `vmovdqu` (notice the 'u' at the end) for unaligned memory and
skip note 2.
2) `vmovdqa` assumes 32-byte aligned memory. Now, `align()` is kind of
buggy, so if you have a normal buffer on the stack that you want to align, that:
align(32) ubyte[32768] buf;
won't work.
One solution is to allocate memory on heap and do slight pointer arithmetic
to have it aligned.

Last minute discovery:
Haha, the compiler flags I used were: -mcpu=avx -inline
With these flags, memcpyD is faster.
_Removing_ -inline resulted in faster code for libc memcpy. I'll have to look
close tomorrow.
(Oh, and the libc memcpy, it seems from disasm, achieves these results with sse3, so 128-bit instructions. I mean.. at least impressive).
May 30, 2019
On Thursday, 30 May 2019 at 00:18:06 UTC, Stefanos Baziotis wrote:
> The current state is that we think that slices should be enough for this need.
> Meaning, you don't need the third size parameter. In this case, ref is better. On the other, in other cases I think that pointers
> are more intuitive.

In D, there's no ugly and unsafe need to pass slices to memcpy, as a simple `dst[] = src[]` can do the job much better, boiling down to a memcpy (with 3rd param) if T is a POD (and the two slices don't overlap, have the same length etc. if bounds checks are enabled).

Taking a slice by ref, if I understand you correctly, would firstly only work with slice lvalues (i.e., no `ptr[0..$-1]` rvalues), and secondly IMO be very confusing and bad for generic code, as I would expect the slice itself to be memcopied then, not its contents.
May 30, 2019
On Thursday, 30 May 2019 at 00:55:54 UTC, Stefanos Baziotis wrote:
> Now, `align()` is kind of buggy

It works fine with LDC, and I guess with GDC too.
May 30, 2019
On Thursday, 30 May 2019 at 01:19:54 UTC, kinke wrote:

> In D, there's no ugly and unsafe need to pass slices to memcpy, as a simple `dst[] = src[]` can do the job much better, boiling down to a memcpy (with 3rd param) if T is a POD (and the two slices don't overlap, have the same length etc. if bounds checks are enabled).

This is an important observation.  My vision for the GSoC project was targeted primarily at druntime. D memcpy would rarely, if ever, be invoked directly by most users.  Expressions like `dst[] = src[]` and other assignment expressions that require memcpy as part of their behaviro, would be lowered by the compiler to the runtime memcpy template.

Mike

May 30, 2019
On Thursday, 30 May 2019 at 01:35:05 UTC, Mike Franklin wrote:
> On Thursday, 30 May 2019 at 01:19:54 UTC, kinke wrote:
>
>> In D, there's no ugly and unsafe need to pass slices to memcpy, as a simple `dst[] = src[]` can do the job much better, boiling down to a memcpy (with 3rd param) if T is a POD (and the two slices don't overlap, have the same length etc. if bounds checks are enabled).
>
> This is an important observation.  My vision for the GSoC project was targeted primarily at druntime. D memcpy would rarely, if ever, be invoked directly by most users.

If we don't really target users, then that makes this:

> Apart from that, I'm still sceptical about whether we should provide
> a version with size..

Not important. Because my thought was that a lot of users would
have some pointers a, b and somehow want to do: memcpy(a, b, for_some_size);

What I'm thinking is that yes, we decouple D from libc _on D Runtime_.
But in general, users may will still want that.