Replacing C's memcpy with a D implementation (page 5)

On 6/11/2018 6:00 AM, Steven Schveighoffer wrote: > No, __doPostblit is necessary -- you are making a copy. > > example: > > File[] fs = new File[5]; > > fs[0] = ...; // initialize fs > auto fs2 = fs; > fs.length = 100; > > At this point, fs points at a separate block from fs2. If you did not do postblit on this, then when one of those arrays is destroyed, the other will become invalid, as the File reference count would go to 0. Yes, you're right. This should probably go as a comment in the code in case it comes up again.

BTW the way memcpy is(was?) implemented in the C runtime coming from the Inter C++ compiler was really enlightening on the sheer difficulty of such a task. First of all there isn't one loop but many depending on the source and destination alignment. - If both are aligned on 16-byte boundaries, source and destination operand would be with MOVAPS/MOVDQA, nothing special - If only the source or destination was misaligned, the function would dispatch to a variant with the core loop loading 16-byte aligned and writing 16-byte unaligned, with the PALIGNR instruction. However, since PALIGNR can't take a runtime value, this variant was _replicated 16 times_. - I don't remember for both source and destination misaligned but you can degenerate this case to the above one. Each of this loop had complicated loop preludes that do the first iteration, and they are so hard to do by hand. It was also the only piece of assembly I've seen that (apparently) successfully used the "prefetch" instructions. This was just the SSE version, AVX was different. I don't know if someone really wrote this code, or if it was all from intrinsics.

Am Mon, 11 Jun 2018 10:54:23 +0000 schrieb Mike Franklin: > On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote: >> On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote: >> >>>> I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design. >>> >>> Just making it a template doesn't automatically enable any of this. >> >> I think it does, because I can then generate specific code based on the type information at compile-time. > > Also, before you do any more nay-saying, you might want to revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates precisely the kind of benefits that can be achieved with these kinds of changes to the compiler/runtime interface. > > Mike I guess for most D runtime hooks, using templates is a good idea to enable inlining and further optimizations. I understand that you actually need to reimplement memcpy, as in your microcontroller usecase you don't want to have any C runtime. So you'll basically have to rewrite the C runtime parts D depends on. However, I think for memcpy and similar functions you're probably better off keeping the C interface. This directly provides the benefit of compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/ system/nogc is simple* either way. For the D implementation, the compiler will verify this for you, for the C implementation, you have to mark the function depending on the C implementation. But that's mostly trivial. On a related note, I agree that the compiler sometimes cheats by ignoring attributes, especially when calling TypeInfo related functions, and this is a huge problem. Runtime TypeInfo is not descriptive enough to fully represent the types and whenever the compiler the casts without properly checking first, there's the possibility of a problem. -- Johannes

On 6/11/2018 11:17 AM, Guillaume Piolat wrote: > I don't know if someone really wrote this code, or if it was all from intrinsics. memcpy is so critical to success it is likely written by Intel itself to ensure every drop of perf is wrung out of the CPU. I was Intel CEO I'd direct the CPU hardware guys to do this and give it away.

June 12, 2018

Re: Replacing C's memcpy with a D implementation

Posted by Mike Franklin
in reply to Johannes Pfau

Permalink

Mike Franklin

Posted in reply to Johannes Pfau

Permalink

On Monday, 11 June 2018 at 18:34:58 UTC, Johannes Pfau wrote:

> I understand that you actually need to reimplement memcpy, as in your microcontroller usecase you don't want to have any C runtime. So you'll basically have to rewrite the C runtime parts D depends on.
>
> However, I think for memcpy and similar functions you're probably better off keeping the C interface. This directly provides the benefit of compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/ system/nogc is simple* either way. For the D implementation, the compiler will verify this for you, for the C implementation, you have to mark the function depending on the C implementation. But that's mostly trivial.

My plans go beyond microcontrollers.  Mostly, I'd like to be able to use more features of D without having to link in a pre-built runtime.  This is especially convenient for cross-compiling scenarios.  By replacing the runtime hooks with templates, and the software building blocks with D implementations, we'd no longer need to obtain a C toolchain and a pre-compiled druntime library just  to get a build for our target.  We'd just need a cross-compiler like LDC or GDC.  After adding druntime to the import path, we'd have everything we need for our target.  Try, today, to create an LDC cross-compiler for a Windows Host and an ARM target like the Raspberry Pi.  You'll quickly realize what a hassle it all is.

Another issue with memcpy being generated by compilers is noone knows what it's actually doing without building and looking at the assembly.  And the assembly will change based what's being built.  Have you tried to look at the GCC implementation.  I have, and I never want to again.  If the implementation were templated in D, we'd be able to see exactly what code would be generated when by simply viewing the D runtime source code.  It'd be easier to understand, predict, port, and enhance.

There isn't one single game-changing benefit this endeavor would bring, and I guess that's why it's difficult to comprehend, but collectively, all the little things it would enable would make it quite compelling.

Mike

On Monday, 11 June 2018 at 08:02:42 UTC, Walter Bright wrote: > On 6/10/2018 9:44 PM, Patrick Schluter wrote: >> See what Agner Fog has to say about it: > > Thanks. Agner Fog gets the last word on this topic! Well, Agner is rarely wrong indeed, but there is a limit to how much material a single person can keep up to date. On newer uarchs, `rep movsb` isn't slower than `rep movsd`, and often performs similar to the best SSE2 implementation (using NT stores). See "BeeOnRope"'s answer to this StackOverflow question for an in-depth discussion about this: https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy AVX2 seems to offer extra performance beyond that, though, if it is available (for example if runtime feature detection is used). I believe I read a comment by Agner somewhere to that effect as well – a search engine will certainly be able to turn up more. — David

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote: > - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c To see what is executed when you call memcpy() on a regular GNU/Linux distro, you'd want to have a look at glibc instead. For example, the AVX2 and AVX512 implementations are in sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S and sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S, as forwarded to by memmove-avx2-unaligned-erms.S and memmove-avx512-unaligned-erms.S. (Pop quiz: Why might a separate "no-vzeroupper" variant be a good idea?) — David

Forums