May 11, 2019
On Saturday, 11 May 2019 at 00:09:08 UTC, Mike Franklin wrote:
> On Friday, 10 May 2019 at 23:51:56 UTC, H. S. Teoh wrote:
>
>> I'm not 100% sure it's a good idea to implement memcpy in D just to prove that it can be done / just to say that we're independent of libc. Libc implementations of fundamental operations, esp. memcpy, are usually optimized to next week and back for the target architecture, taking advantage of the target arch's quirks to maximize performance. Not to mention that advanced compiler backends recognize calls to memcpy and can optimize it in ways they can't optimize a generic D function they fail to recognize as being equivalent to memcpy. I highly doubt a generic D implementation could hope to beat that, and it's a little unrealistic, given our current manpower situation, for us to be able to optimize it for each target arch ourselves.
>
> I understand that point of view.  Indeed we have to demonstrate benefit.  One benefit is to not have to obtain a C toolchain when building D programs.  That is actually quite an inconvenient barrier to entry when cross-compiling (e.g. for developing microcontroller firmware on a PC).
>
> I'm also hoping that a D implementation would be easier to comprehend than something like this:  https://github.com/opalmirror/glibc/blob/c38d0272b7c621078d84593c191e5ee656dbf27c/sysdeps/arm/memcpy.S  The D implementation still has to handle all of those corner cases, but I'd rather read D code with inline assembly sprinkled here and there than read the entire thing in assembly.  The goal with the D implementation would be to minimize the assembly.
>
> For compilers that already do something special with memcpy and don't require a C standard library, there's no reason to do anything.  My initial exploration into this has shown that DMD is not one of those compilers.

Also, take a look at this data:  https://forum.dlang.org/post/jdfiqpronazgglrkmwfq@forum.dlang.org  Why is DMD making 48,000 runtime calls to memcpy to copy 8 bytes of data?  Many of those calls should be inlined.  I see opportunity for improvement there.

Mike
May 10, 2019
On Sat, May 11, 2019 at 12:23:31AM +0000, Mike Franklin via Digitalmars-d-announce wrote: [...]
> Also, take a look at this data: https://forum.dlang.org/post/jdfiqpronazgglrkmwfq@forum.dlang.org  Why is DMD making 48,000 runtime calls to memcpy to copy 8 bytes of data? Many of those calls should be inlined.  I see opportunity for improvement there.
[...]

When it comes to performance, I've essentially given up looking at DMD output. DMD's inliner gives up far too easily, leading to a lot of calls that aren't inlined when they really should be, and DMD's optimizer does not have loop unrolling, which excludes a LOT of subsequent optimizations that could have been applied.  I wouldn't base any performance decisions on DMD output. If LDC or GDC produces non-optimal code, then we have cause to do something. Otherwise, IMO we're just uglifying D code and making it unmaintainable for no good reason.


T

-- 
Recently, our IT department hired a bug-fix engineer. He used to work for Volkswagen.
May 11, 2019
On Saturday, 11 May 2019 at 00:32:54 UTC, H. S. Teoh wrote:

> When it comes to performance, I've essentially given up looking at DMD output. DMD's inliner gives up far too easily, leading to a lot of calls that aren't inlined when they really should be, and DMD's optimizer does not have loop unrolling, which excludes a LOT of subsequent optimizations that could have been applied.  I wouldn't base any performance decisions on DMD output. If LDC or GDC produces non-optimal code, then we have cause to do something. Otherwise, IMO we're just uglifying D code and making it unmaintainable for no good reason.

I think this thread is beginning losing sight of the larger picture.  What I'm trying to achieve is the opt-in continuum that Andrei mentioned elsewhere on this forum.  We can't do that with the way the compiler and runtime currently interact.  So, the first task, which I'm trying to get around to, is to convert runtime hooks to templates.  Using the compile-time type information will allow us to avoid `TypeInfo`, therefore classes, therefore the entire D runtime.  We're now much closer to the opt-in continuum Andrei mentioned previously on this forum.  Now let's assume that's done...

Those new templates will eventually call a very few functions from the C standard library, memcpy being one of them.  Because the runtime hooks are now templates, we have type information that we can use in the call to memcpy.  Therefore, I want to explore implementing `void memcpy(T)(ref T dst, const ref T src) @safe, nothrow, pure, @nogc` rather than `void* memcpy(void*, const void*, size_t)`  There are some issues here such as template bloat and compile times, but I want to explore it anyway.  I'm trying to imagine, what would memcpy in D look like if we didn't have a C implementation clouding narrowing our imagination.  I don't know how that will turn out, but I want to explore it.

For LDC we can just do something like this...

void memcpy(T)(ref T dst, const ref T src) @safe, nothrow, @nogc, pure
{
version(LDC)
{
    // after casting dst and src to byte arrays...
    // (probably need to put the casts in a @trusted block)
    for(int i = 0; i < size; i++)
        dstArray[i] = srcArry[i];
}
}

LDC is able to see that as memcpy and do the right thing.  Also if the LDC developers want to do their own thing altogether, more power to them.  I don't see anything ugly about it.

However, DMD won't do the right thing.  I guess others are thinking that we'd just re-implement `void* memcpy(void*, const void*, size_t)` in D and we'd throw in a runtime call to `memcpy(&dstArray[0], &srcArray[0], T.sizeof())`.  That's ridiculous.  What I want to do is use the type information to generate an optimal implementation (considering size and alignment) that DMD will be forced to inline with `pragma(inline)`  That implementation can also take into consideration target features such as SIMD.  I don't believe the code will be complex, and I expect it to perform at least as well as the C implementation.  My initial tests show that it will actually outperform the C implementation, but that could be a problem with my tests.  I'm still researching it.

Now assuming that's done, we now have language runtime implementations that are isolated from heavier runtime features (like the `TypeInfo` classes) that can easily be used in -betterC builds, bare-metal systems programming, etc. simply by importing them as a header-only library; it doesn't require first compiling (or cross-compiling) a runtime for linking with your program; you just import and go.  We're now much closer to the opt-in continuum.

Now what about development of druntime itself.  Well wouldn't it be nice if we could utilize things like `std.traits`, `std.meta`, `std.conv`, and a bunch of other stuff from Phobos?  Wouldn't it also be nice if we could use that stuff in DMD itself without importing Phobos?  So let's take that stuff in Phobos that doesn't need druntime and put them in a library that doesn't require druntime (i.e. utiliD).  Now druntime can import utiliD and have more idiomatic-D implementations.

But the benefits don't stop there, bare-metal developers, microcontroller developers, kernel driver developers, OS developers, etc... can all use the runtime-less library to bootstap their own implementations without having to re-invent or copy code out of Phobos and druntime.

I'm probably not articulating this vision well.  I'm sorry.  Maybe we'll just have to hope I can find the time and energy to do it myself and then others will finally see from the results.  Or maybe I'll go have a nice helping of crow.

Mike


May 10, 2019
On Sat, May 11, 2019 at 01:45:08AM +0000, Mike Franklin via Digitalmars-d-announce wrote: [...]
> I think this thread is beginning losing sight of the larger picture. What I'm trying to achieve is the opt-in continuum that Andrei mentioned elsewhere on this forum.  We can't do that with the way the compiler and runtime currently interact.  So, the first task, which I'm trying to get around to, is to convert runtime hooks to templates. Using the compile-time type information will allow us to avoid `TypeInfo`, therefore classes, therefore the entire D runtime.  We're now much closer to the opt-in continuum Andrei mentioned previously on this forum.  Now let's assume that's done...

Yes, that's definitely a direction we want to head in.  I think it will be very beneficial.


> Those new templates will eventually call a very few functions from the C standard library, memcpy being one of them.  Because the runtime hooks are now templates, we have type information that we can use in the call to memcpy.  Therefore, I want to explore implementing `void memcpy(T)(ref T dst, const ref T src) @safe, nothrow, pure, @nogc` rather than `void* memcpy(void*, const void*, size_t)`  There are some issues here such as template bloat and compile times, but I want to explore it anyway.  I'm trying to imagine, what would memcpy in D look like if we didn't have a C implementation clouding narrowing our imagination.  I don't know how that will turn out, but I want to explore it.

Put this way, I think that's a legitimate area to explore. But copying a block of memory from one place to another is simply just that, copying a block of memory from one place to another.  It just boils down to how to copy N bytes from A to B in the fastest way possible. For that, you just reduce it to moving K words (the size of which depends only on the target machine, not the incoming type) of memory from A to B, plus or minus a few bytes at the end for non-aligned data. The type T only matters if you need to do type-specific operations like call default ctors / dtors, but at the memcpy level that should already have been taken care of by higher-level code, and it isn't memcpy's concern what ctors/dtors to invoke.

The one thing knowledge of T can provide is whether or not T[] can be unaligned. If T.sizeof < machine word size, then you need extra code to take care of the start/end of the block; otherwise, you can just go straight to the main loop of copying K words from A to B. So that's one small thing we can take advantage of. It could save a few cycles by avoiding a branch hazard at the start/end of the copy, and making the code smaller for inlining.

Anything else you optimize on copying K words from A to B would be target-specific, like using vector ops, specialized CPU instructions, and the like. But once you start getting into that, you start getting into the realm of whether all the complex setup needed for, e.g., a vector op is worth the trouble if T.sizeof is small. Perhaps here's another area where knowledge of T can help (if T is small, just use a naïve for-loop; if T is sufficiently large, it could be worth incurring the overhead of setting up vector copy registers, etc., because it makes copying the large body of T faster).

So potentially a D-based memcpy could have multiple concrete implementations (copying strategies) that are statically chosen based on the properties of T, like alignment and size.


[...]
> However, DMD won't do the right thing.

Honestly, at this point I don't even care.


> I guess others are thinking that we'd just re-implement `void*
> memcpy(void*, const void*, size_t)` in D and we'd throw in a runtime
> call to `memcpy(&dstArray[0], &srcArray[0], T.sizeof())`.  That's
> ridiculous.  What I want to do is use the type information to generate
> an optimal implementation (considering size and alignment) that DMD
> will be forced to inline with `pragma(inline)`.

It could be possible to select multiple different memcpy implementations by statically examining the properties of T.  I think that might be one advantage D could have over just calling libc's memcpy.  But you have to be very careful not to outdo the compiler's optimizer so that it doesn't recognize it as memcpy and fails to apply what would otherwise be a routine optimization pass.


> That implementation can also take into consideration target features such as SIMD.  I don't believe the code will be complex, and I expect it to perform at least as well as the C implementation.  My initial tests show that it will actually outperform the C implementation, but that could be a problem with my tests.  I'm still researching it.

Actually, if you want to compete with the C implementation, you might find that things could get quite hairy. Maybe not with memcpy, but other functions like memchr have very clever hacks to speed it up that you probably wouldn't think of without reading C library source code. There may also be subtle differences that change depending on the target; it used to be that `rep movsd` was faster in spite of requiring more overhead setting up; but last I read, newer CPUs seem to have `rep movsd` perform rather poorly whereas a plain ole for-loop actually outperforms `rep movsd`.  At a certain point, this just begs the question "should I just let the compiler's backend do its job by telling it plainly that I mean memcpy, or should I engage in asm-hackery because I'm confident I can outdo the compiler's codegen?".

One thing that might be worth considering is for the *compiler* to expose a memcpy intrinsic, and then let the compiler decide how best to implement it (using its intimate knowledge of the target machine arch), rather than trying to do it manually in library code.


> Now assuming that's done, we now have language runtime implementations that are isolated from heavier runtime features (like the `TypeInfo` classes) that can easily be used in -betterC builds, bare-metal systems programming, etc. simply by importing them as a header-only library; it doesn't require first compiling (or cross-compiling) a runtime for linking with your program; you just import and go.  We're now much closer to the opt-in continuum.
> 
> Now what about development of druntime itself.  Well wouldn't it be nice if we could utilize things like `std.traits`, `std.meta`, `std.conv`, and a bunch of other stuff from Phobos?

Based on what Andrei has voiced, the way to go would be to merge Phobos and druntime into one, by making Phobos completely opt-in so that you don't pay for what you don't use from the heavier / higher-level parts of Phobos.  At a certain point it becomes clear that the division between Phobos and druntime is artificial, the result of historical accident, and not a logical necessity that we have to keep. If Phobos is made completely pay-as-you-go, the distinction becomes completely irrelevant and the two might as well be merged into one.


> Wouldn't it also be nice if we could use that stuff in DMD itself without importing Phobos?  So let's take that stuff in Phobos that doesn't need druntime and put them in a library that doesn't require druntime (i.e. utiliD).  Now druntime can import utiliD and have more idiomatic-D implementations.

See, this trouble is caused by the artificial boundary between Phobos and druntime.  We should look into breaking down this barrier, not enforcing it.


> But the benefits don't stop there, bare-metal developers, microcontroller developers, kernel driver developers, OS developers, etc... can all use the runtime-less library to bootstap their own implementations without having to re-invent or copy code out of Phobos and druntime.
[...]

I think the logical goal is to make Phobos completely pay-as-you-go. IOW, an actual *library*, as opposed to a tangled hairball of dependencies that always comes with strings attached (can't import one small thing without pulling in the rest of the hairball). A library is supposed to be a set of resources which you can draw from as needed. Pulling out one book (module) should not require pulling out half the library along with it.


T

-- 
Once the bikeshed is up for painting, the rainbow won't suffice. -- Andrei Alexandrescu
May 11, 2019
On Friday, 10 May 2019 at 23:58:37 UTC, Mike Franklin wrote:
>
> I don't know how a proper assembly implementation would not be performant.  Perhaps you could elaborate.

Inline assembly prevents a lot of optimizations that give large performance gains such as constant propagation. Say you implement a memcpy with a different signature than C's mempcy (because of slices instead of pointers), then the optimizer does not know what the semantics of that function are and will need the function to be transparent (not assembly) to do such optimizations.

But I'm sure you know all that, so that's not your question. :)

In the case of reimplementing memcpy/mem* in a function with the same signature as libc, that is not supposed to be inlined (like the current libc functions), then I also think the use of inline asm will not give a perf penalty. Be careful to recreate the exact same semantics as those libc functions because the optimizer is going to _assume_ it knows _exactly_ what those functions are doing.

cheers,
  Johan

May 11, 2019
On Saturday, 11 May 2019 at 05:39:12 UTC, H. S. Teoh wrote:

> So potentially a D-based memcpy could have multiple concrete implementations (copying strategies) that are statically chosen based on the properties of T, like alignment and size.

Exactly.

> [...]
>> However, DMD won't do the right thing.
>
> Honestly, at this point I don't even care.

Personally I'd be fine with just killing of DMD's backend and just investing in LDC and GDC, but I don't think that's going to happen, and because of that, we have to care.  DMD is where policy and precedent is set for D.  To influence the direction of D, it must be done throught DMD.

> It could be possible to select multiple different memcpy implementations by statically examining the properties of T.  I think that might be one advantage D could have over just calling libc's memcpy.  But you have to be very careful not to outdo the compiler's optimizer so that it doesn't recognize it as memcpy and fails to apply what would otherwise be a routine optimization pass.

I understand.  That's why I'm calling it an "exploration" at this time.  I want to see what can and can't be done.

> At a certain point, this just begs the question "should I just let the compiler's backend do its job by telling it plainly that I mean memcpy, or should I engage in asm-hackery because I'm confident I can outdo the compiler's codegen?".

I get that, but DMD is not the kind of backend that does that stuff.  If I could rely on DMD's, LDC's, and GDC's backend to just insert an optimized compiler intrinsic, without the C standard library, I would just leverage that. But that doesn't seem to be the world we're currently in.

> One thing that might be worth considering is for the *compiler* to expose a memcpy intrinsic, and then let the compiler decide how best to implement it (using its intimate knowledge of the target machine arch), rather than trying to do it manually in library code.

I would love for the backends to just know how to copy memory efficiently for all of their targets without me having to do anything, and without linking in the C standard library, but that's not what I'm seeing from the compilers right now.

> Based on what Andrei has voiced, the way to go would be to merge Phobos and druntime into one, by making Phobos completely opt-in so that you don't pay for what you don't use from the heavier / higher-level parts of Phobos.  At a certain point it becomes clear that the division between Phobos and druntime is artificial, the result of historical accident, and not a logical necessity that we have to keep. If Phobos is made completely pay-as-you-go, the distinction becomes completely irrelevant and the two might as well be merged into one.

Yes, but is making Phobos pay-as-you-go a real possibility?  I don't see it that way because all of Phobos has been developed under the assumption that all language features are implemented and available.  utiliD would be usable in an environment where only a subset of D's language features are available.

Also, Phobos has been developed under the assumption that any module in Phobos or druntime can be utilized as a dependency in any other module.  That has created a dependency mess in Phobos and I don't see how that can be disentangled without breaking everyone's code.  Furthermore, there is no clear hierarchy in Phobos where it is clear at the API level what language features are required for each module/function/whatever.  With utiliD, it is much clearer where the line is drawn in the hierarchy of language features.  Phobos will never be pay-as-you-go if you can't see what you're paying for as you go.

> See, this trouble is caused by the artificial boundary between Phobos and druntime.  We should look into breaking down this barrier, not enforcing it.

I agree.  We could actually merge druntime and Phobos into a single library today.  I also find the divide between Phobos and druntime artificial, but my goal with utiliD is different. I'm trying to create a library that does not require runtime language features.  I'm not proposing an artificial division that currently exists.  I'm trying to build something equivalent to a stack, where you start at a very low level (utilid) and add layers of increasing capability.  That's not what we have with Phobos and druntime today.

> I think the logical goal is to make Phobos completely pay-as-you-go. IOW, an actual *library*, as opposed to a tangled hairball of dependencies that always comes with strings attached (can't import one small thing without pulling in the rest of the hairball). A library is supposed to be a set of resources which you can draw from as needed. Pulling out one book (module) should not require pulling out half the library along with it.

I agree, but that hairball is exactly what Phobos is right now. I don't see any way to start from that mess and achieve the pay-as-you-go opt-in continuum.  In a way, I'm starting over with utiliD, but I believe there is still value in druntime and Phobos that can be salvaged to start building an opt-in, pay-as-you-go stack of increasing features, sophistication, and capability in D, where you know, by what you're importing, what you're getting and what it costs.

Mike


May 20, 2019
On Friday, 10 May 2019 at 23:51:56 UTC, H. S. Teoh wrote:
> Libc implementations of fundamental operations, esp. memcpy, are usually optimized to next > week and back for the target architecture, taking advantage of the target arch's quirks to > maximize performance
Yeah about that...
Level1 Diagnostic: Fixing our Memcpy Troubles (for Looking Glass)
https://www.youtube.com/watch?v=idauoNVwWYE
1 2
Next ›   Last »