January 08, 2019
On 1/8/19 4:23 PM, kinke wrote:
> On Tuesday, 8 January 2019 at 01:44:08 UTC, Mike Franklin wrote:
>> Anyway, my suggestion is to create a new library separate from druntime and phobos that has no dependencies whatsoever (no libc, no libstdc++, no OS dependencies, no druntime dependency, etc.).  I mean it; **no dependencies**.  Not even object.d. The only thing it should require is a D compiler.
>>
>> That library can then be imported by druntime, phobos, betterC builds, or even the compiler itself. It will take strict enforcement of the "no dependency" rule and good judgment to keep the scope from ballooning, but it may be a good place for things like `traits`, `meta` and others.
> 
> I also feel the need for at least 1 another base library. My focus is on the fundamental compiler support functions, like initializing/comparing/copying arrays and general associative arrays support, as they are fundamental to the language and their compilers (not talking about TypeInfos, ModuleInfos, Object etc.).

This is self-contradictory, as AA's require TypeInfo.

Though I agree with the goal. It's just not a "now" goal, we first need to fix these components so they DON'T depend on such things as TypeInfo.

-Steve
January 09, 2019
On Tuesday, 8 January 2019 at 21:26:51 UTC, Steven Schveighoffer wrote:

>> I also feel the need for at least 1 another base library. My focus is on the fundamental compiler support functions, like initializing/comparing/copying arrays and general associative arrays support, as they are fundamental to the language and their compilers (not talking about TypeInfos, ModuleInfos, Object etc.).
>
> This is self-contradictory, as AA's require TypeInfo.
>
> Though I agree with the goal. It's just not a "now" goal, we first need to fix these components so they DON'T depend on such things as TypeInfo.

Steven is right (as usual) here.  There has to be a serious effort to remove the dependency on runtime information that is available at compile-time.  I tried quite hard on that in 2017~2018, but I ran into all sorts of problems.

Exhibit A:
We can set an array's length in `@safe`, `nothrow`, `pure` code. But, it gets lowered to a runtime hook that is neither `@safe`, `nothrow`, nor `pure` (https://github.com/dlang/druntime/blob/e47a00bff935c3f079bb567a6ec97663ba384487/src/rt/lifetime.d#L1265).  In other words, the compiler-runtime interface is a lie.  So, if you try to rewrite that as a template to remove the dependency on `TypeInfo`, then the template will run through the semantic phase of the compiler and now you have to be honest, and it doesn't compiler. So, to make that work you have to make all of the code that `_d_arraysetlengthT` calls `@safe`, `nothrow`, nor `pure` to prevent breakage, you'll find that none of it compiles because the "turtles at the bottom" (i.e. `memcpy`, `malloc`, etc...) aren't `pure` or whatever attribute constraint you're trying to apply.

Exhibit B:
I tried to convert `_d_arraycast` to a template in https://github.com/dlang/druntime/pull/2268 and ran into similar problems.  Some tried to help with a `pureMalloc` implementation in https://github.com/dlang/druntime/pull/2276, but that didn't go well either.  Walter responded with "Since realloc() free's memory, it cannot ever be considered pure."  Well, what the hell are we supposed to do then? IMO, this having dynamic stack allocation for arrays and strings will help (https://issues.dlang.org/show_bug.cgi?id=18788).  GDC and LDC already provide this, but DMD's implementation is in druntime (https://github.com/dlang/druntime/blob/9a8edfb48e4842180c706ee26ebd8edb10be53f4/src/rt/alloca.d), so it requires linking in druntime, and now we're at a catch 22.  I asked Walter for help with this, as it is beyond my current skills, but he said he didn't have time.

Here's what I think will help:
1.  Get `alloca` or dynamic stack array allocation working.  This will help a lot because we won't have to reach for `malloc` and friends for simple allocations like generating dynamic assert messages
2.  Convert `memcpy`, `memset`, and `memcmp` to strongly-typed D templates so they can be used in the implementations when converting runtime hooks to templates.  I did some exploration on that and published my results at https://github.com/JinShil/memcpyD.  Unfortunately, DMD is missing an AVX512 implementation so I couldn't continue.

Lots of obstacles here and I don't see it happening without Walter and Andrei making it a priority.

Mike


January 09, 2019
On Wed, 09 Jan 2019 02:32:50 +0000, Mike Franklin wrote:
> I tried to convert `_d_arraycast` to a template in https://github.com/dlang/druntime/pull/2268 and ran into similar problems.  Some tried to help with a `pureMalloc` implementation in https://github.com/dlang/druntime/pull/2276, but that didn't go well either.  Walter responded with "Since realloc() free's memory, it cannot ever be considered pure."  Well, what the hell are we supposed to do then?

The specific thing that he replied to was having a public symbol for realloc that was considered pure. Perhaps a private fakePureRealloc() would be more palatable?
January 09, 2019
On Wednesday, 9 January 2019 at 03:32:17 UTC, Neia Neutuladh wrote:

> The specific thing that he replied to was having a public symbol for realloc that was considered pure. Perhaps a private fakePureRealloc() would be more palatable?

Perhaps; I'm not sure.  The `pureMalloc` implementation is a lot of clever hackery anyway, so I think it would be best to just implement stack-allocated dynamic arrays (i.e. https://issues.dlang.org/show_bug.cgi?id=18788) and avoid the games.  That would have solved the immediate need I had for converting runtime hooks to templates, and would help some of that work move forward.

Mike

January 09, 2019
On 2019-01-08 06:37, Mike Franklin wrote:

> I spent some time trying to think through some of the issues with druntime, and came up with this:
> 
> Right now, druntime is somewhat of a monolith trying to be too many things.
>    * utilities (traits, string utilities, type conversion utilities, etc...)
>    * compiler lowerings
>    * C standard library bindings
>    * C++ standard library bindings
>    * C standard library bindings
>    * Operating system bindings
>    * OS abstractions (thread, fibers, context switching, etc...)
>    * Compiler lowerings
>    * DWARF implementation
>    * TLS implementation
>    * GC
>    * (probably more)
> 
> So, I suggest something like this:
> ----------------------------------
> * core.util - a.k.a utiliD - Just utility implementations written in D (e.g `std.traits`, `std.meta`, etc. No dependencies whatsoever. No operating system or platform abstractions. No high-level language features(e.g. exceptions)
>      * public imports: (none)
>      * private imports: (none)
> 
> * core.stdc - C standard library bindings - libc functions verbatim; no convenience or utility implementations
>      * public imports: (none)
>      * private imports: core.util
> 
> * core.stdcpp - C++ standard library bindings - libstdc++ data structures verbatim; no convenience or utility implementations
>      * public imports: (none)
>      * private imports: core.util
> 
> * sys - OS/Platform bindings - operating system implementations verbatim; no convenience or utility implementations
>      * public imports: (none)
>      * private imports: core.util
> 
> * core.pal - Platform/OS abstractions - threads, fibers, context switching, etc.
>      * public imports: (none)
>      * private imports: core.util, sys, core.libc
> 
> * core.d - compiler support (compiler lowerings, runtime initialization, TLS implementation, DWARF implementation, GC, etc...)
>      * public imports : core.util
>      * private imports : core.pal
> 
> * druntime - Just a top-level package containing public imports, aliases, and compiler support. No other implementations
>      * public imports: core.pal, core.d
>      * private imports: core.util
> 
> * std - phobos
>      * public imports: (none)
>      * private imports: druntime
> 
> There are likely other suitable ways to organize it, but that's just what I could come up with after thinking through it a little.
> 
> I would prefer if each of those were in their own repository and even move some of them to Deimos or dub, but that would probably irritate a lot of people.  I'd also prefer to have each of those in their own packages, but D is probably too deep in technical debt for that.  (See also https://issues.dlang.org/show_bug.cgi?id=11666)
> 
> So, to make it more palatable, I suggest:
> -----------------------------------------
>    * `core.util` gets own repository so it can be independently added to other repositories as a self-contained/freestanding dependency
> 
>    * `core.stdc`, `core.stdcpp`, `sys`, `core.pal`, and `core.d` all go into the druntime monolith like it is today.
> 
>    * phobos remains much like it is today.
> 
> In the context of the discussion at hand, `std.traits`, `std.meta`, and other utilities can be moved to `core.util`. `core.util` can then be added as a dependency to dmd, druntime, and phobos.  The rest will probably have to wait for D3 :/

I like this approach.

-- 
/Jacob Carlborg
January 09, 2019
On 2019-01-09 03:32, Mike Franklin wrote:

> Here's what I think will help:
> 1.  Get `alloca` or dynamic stack array allocation working.  This will help a lot because we won't have to reach for `malloc` and friends for simple allocations like generating dynamic assert messages

What's the problem with "alloca"?

> 2.  Convert `memcpy`, `memset`, and `memcmp` to strongly-typed D templates so they can be used in the implementations when converting runtime hooks to templates.  I did some exploration on that and published my results at https://github.com/JinShil/memcpyD.  Unfortunately, DMD is missing an AVX512 implementation so I couldn't continue.

What do you mean "couldn't continue"? It's possible to implement "memcpy" without AVX512. Am I missing something?

-- 
/Jacob Carlborg
January 09, 2019
On Wednesday, 9 January 2019 at 11:01:46 UTC, Jacob Carlborg wrote:
> On 2019-01-09 03:32, Mike Franklin wrote:
>
>> Here's what I think will help:
>> 1.  Get `alloca` or dynamic stack array allocation working.  This will help a lot because we won't have to reach for `malloc` and friends for simple allocations like generating dynamic assert messages
>
> What's the problem with "alloca"?

In DMD you can't use it without linking in the runtime, but in LDC and GDC, you can.  One of the goals of implementing these runtime hooks as templates is to make more features available in -betterC builds, or for pay-as-you-go runtime implementations.  If you need to link in druntime to get `alloca`, you can't implement the runtime hooks as templates and have them work in -betterC.

>> 2.  Convert `memcpy`, `memset`, and `memcmp` to strongly-typed D templates so they can be used in the implementations when converting runtime hooks to templates.  I did some exploration on that and published my results at https://github.com/JinShil/memcpyD.  Unfortunately, DMD is missing an AVX512 implementation so I couldn't continue.
>
> What do you mean "couldn't continue"? It's possible to implement "memcpy" without AVX512. Am I missing something?

Yes, it's possible, but I don't think it will ever be accepted if it doesn't perform at least as well as the optimized versions in C or assembly that use AVX512 or other SIMD features.  It needs to be at least as good as what libc provides, so we need to be able to leverage these unique hardware features to get the best performance.

Mike
January 09, 2019
On Wednesday, 9 January 2019 at 11:49:40 UTC, Mike Franklin wrote:
> On Wednesday, 9 January 2019 at 11:01:46 UTC, Jacob Carlborg wrote:
>> On 2019-01-09 03:32, Mike Franklin wrote:
>>
>>> Here's what I think will help:
>>> 1.  Get `alloca` or dynamic stack array allocation working.  This will help a lot because we won't have to reach for `malloc` and friends for simple allocations like generating dynamic assert messages
>>
>> What's the problem with "alloca"?
>
> In DMD you can't use it without linking in the runtime, but in LDC and GDC, you can.  One of the goals of implementing these runtime hooks as templates is to make more features available in -betterC builds, or for pay-as-you-go runtime implementations.  If you need to link in druntime to get `alloca`, you can't implement the runtime hooks as templates and have them work in -betterC.
>
>>> 2.  Convert `memcpy`, `memset`, and `memcmp` to strongly-typed D templates so they can be used in the implementations when converting runtime hooks to templates.  I did some exploration on that and published my results at https://github.com/JinShil/memcpyD.  Unfortunately, DMD is missing an AVX512 implementation so I couldn't continue.
>>
>> What do you mean "couldn't continue"? It's possible to implement "memcpy" without AVX512. Am I missing something?
>
> Yes, it's possible, but I don't think it will ever be accepted if it doesn't perform at least as well as the optimized versions in C or assembly that use AVX512 or other SIMD features.  It needs to be at least as good as what libc provides, so we need to be able to leverage these unique hardware features to get the best performance.

AVX512 concerns only a very small part of processors on the market (Skylake, Canon Lake and Cascade Lake). AMD will never implement it and the number of people upgrading to one of the lake cpus from some recent chip is also not that great.
I don't see why not having it implemented yet is blocking anything. People who really need AVX512 performance will have implemented memcpy themselves already and for the others, they will have to wait a little bit. It's not as if it couldn't be added later. I really don't understand the problem.
This said, another issue with memcpy that very often gets lost is that, because of the fancy benchmarking, its system performance cost is often wrongly assessed, and a lot of heroic efforts are put in optimizing big block transfers, while in reality it's mostly called on small (postblit) to medium blocks. Linus Torvalds had once a rant on that subject on realworldtech.
https://www.realworldtech.com/forum/?threadid=168200&curpostid=168589



January 09, 2019
On Wednesday, 9 January 2019 at 12:31:13 UTC, Patrick Schluter wrote:
> On Wednesday, 9 January 2019 at 11:49:40 UTC, Mike Franklin wrote:
>> [...]
>
> AVX512 concerns only a very small part of processors on the market (Skylake, Canon Lake and Cascade Lake). AMD will never implement it and the number of people upgrading to one of the lake cpus from some recent chip is also not that great.
> I don't see why not having it implemented yet is blocking anything. People who really need AVX512 performance will have implemented memcpy themselves already and for the others, they will have to wait a little bit. It's not as if it couldn't be added later. I really don't understand the problem.
> This said, another issue with memcpy that very often gets lost is that, because of the fancy benchmarking, its system performance cost is often wrongly assessed, and a lot of heroic efforts are put in optimizing big block transfers, while in reality it's mostly called on small (postblit) to medium blocks. Linus Torvalds had once a rant on that subject on realworldtech.
> https://www.realworldtech.com/forum/?threadid=168200&curpostid=168589

By reading (quiclkly) these articles:
- https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/
- https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

it seem that using avx512 can be good if you pin a thread to a core in order to process only avx512 statement.
January 09, 2019
On Wed, Jan 09, 2019 at 12:31:13PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
> This said, another issue with memcpy that very often gets lost is that, because of the fancy benchmarking, its system performance cost is often wrongly assessed, and a lot of heroic efforts are put in optimizing big block transfers, while in reality it's mostly called on small (postblit) to medium blocks.

EXACTLY!!!

Some time ago I took an interest in implementing the equivalent of strchr in the most optimized way possible. For that, I wrote several of my own algorithms and also perused the glibc implementation.

Eventually, I realized that the glibc implementation, which uses fancy 64-bit-word scanning with a lot of setup overhead and messy starting/trailing cases, is optimizing for very large scans, i.e., when the byte being sought occurs only rarely in a very large haystack.  In those cases it's at the top of benchmarks.  However, in the arguably more common case where the byte being sought occurs relatively frequently in small- to medium-sized haystacks, repeatedly searching the haystack incurs a ton of overhead setting up all that fancy machinery, branch hazards, and what-not, where a plain ole `while (*ptr++ != needle) {}` works much better.

I suspect many of the C library functions of this sort (incl. memcpy + friends) have a tendency to suffer from this sort of premature optimization.

Not to mention that often overly-specialized benchmarks of this sort fail to account for bias caused by the CPU's branch predictor learning the benchmark and the cache hierarchy amortizing the cost of repeatedly searching the same haystack -- things you rarely do in real-life applications.  There's a big risk of your "super-optimized" algorithm ending up optimizing for an unrealistic use-case, but having only mediocre or sometimes even poor performance in real-world computations.


> Linus Torvalds had once a rant on that subject on realworldtech. https://www.realworldtech.com/forum/?threadid=168200&curpostid=168589

Nice.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher