June 04, 2019
On 6/3/2019 3:45 PM, Andrei Alexandrescu wrote:
> (2) is quite specious and really needs some evidence. Is cruft in memcpy really an issue? I looked memcpy() implementations a while ago but didn't save bookmarks. Did a google search just now and found https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, which is very far from cruft-ridden. I do remember elaborate implementations of memcpy but so are (somewhat ironically) the 512 lines of the proposed implementation. I found one here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

And here:

https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
June 05, 2019
On Tuesday, 4 June 2019 at 22:18:20 UTC, Walter Bright wrote:
>
> And here:
>
> https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM

Please, consider again that this is not the version with which we're trying to compete with. Mike posted this link: https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=14ec2285c0f82b570bf872c5b9ff0a7f25724dfd;hb=HEAD

This looks like the one that is called. Consider also the other link Mike posted:
https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h

Although I did not have time to benchmark, my guess is that this, which is one from Intel, is not at all enough against libc.
June 05, 2019
On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis wrote:

> This looks like the one that is called. Consider also the other link Mike posted:
> https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
>
> Although I did not have time to benchmark, my guess is that this, which is one from Intel, is not at all enough against libc.

I benchmarked the older rte_memcpy here (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/arch/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit faster than libc.  It's worth a deeper look.

Mike
June 05, 2019
On Wednesday, 5 June 2019 at 01:14:26 UTC, Mike Franklin wrote:
> On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis
>
> I benchmarked the older rte_memcpy here (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/arch/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit faster than libc.  It's worth a deeper look.
>
> Mike

Our Dmemcpy is faster than libc on a Linux virtual machine too. :p

But yes, again, take what I said with a grain of salt, it's just an assumption. Indeed it deserves greater analysis.

June 05, 2019
On Wednesday, 5 June 2019 at 01:21:20 UTC, Stefanos Baziotis wrote:
> On Wednesday, 5 June 2019 at 01:14:26 UTC, Mike Franklin wrote:
>> On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis
>>
>> I benchmarked the older rte_memcpy here (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/arch/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit faster than libc.  It's worth a deeper look.
>>
>> Mike
>
> Our Dmemcpy is faster than libc on a Linux virtual machine too. :p
>
> But yes, again, take what I said with a grain of salt, it's just an assumption. Indeed it deserves greater analysis.

How did you compile the code? GCC and Clang both target baseline x64, to use features like AVX2 you have to enable them, that of course means that not all CPUs will be able to run the code, though it will run faster on those that do.

I'd say this should include ARM as well, but there's one D compiler that doesn't support it so...
June 05, 2019
On Wednesday, 5 June 2019 at 03:00:13 UTC, Exil wrote:

> How did you compile the code? GCC and Clang both target baseline x64, to use features like AVX2 you have to enable them, that of course means that not all CPUs will be able to run the code, though it will run faster on those that do.

If you're referring to the rte_memcpy file, I compiled it with -march=native.





June 05, 2019
On Tuesday, 4 June 2019 at 08:31:54 UTC, KnightMare wrote:
> TL;DR
> Should we attn to WASM where there are no system things (mmap, allocators), where memory is an array of ints?

If any of the work from this project gets merged into druntime (which appears will be an uphill battle) it should be easier to port druntime to new platforms like WASM.  That is one of the motivations:  To reduce the dependency on libc to platform implementation detail that any platform can override/reimplement/supplement as needed without impacting any other platform.

Mike
June 28, 2019
On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
> The goal of this project is to remove the dependency of the D Runtime from the C Standard Library.

An update regarding the project. There was a lot of turbulance in this project, so I'm sorry I did not post earlier.

Previous month
==============

In this month the goals were replacements for memcpy(), memmove() and memset(), named
Dmemcpy, Dmemmove and Dmemset. Dmemcpy and Dmemmove is merged in one repo [1]
and the Dmemset is this [2]

The goal was to create fast versions of those, targetted to x86_64 and DMD.
Because of that and because of Blockers (refer to that later), there is some inline ASM in those implementations.
There was an effort for this to be minimized (currently it's only on Dmemcpy),
because I got informed that pure D should be the first priority.

In the last week there was an effort to create a test suite and a benchmark suite
for these repos. Quoting Mike and Johannes:

# Make sure the implementation works for all kinds of D types (basic types, structs, classes, static arrays, and dynamic arrays)
  * Add naive implementations for now to fill the gaps.

// NOTE(stefanos): Meaning, when x86 is not available or in any case that my code is not
// able to be compiled for the target, there should a minimial pure D fall-back implementation.
// NOTE(stefanos): Classes are not tested, more on that on the Blockers.

# Separate benchmarks from tests
Anyone visiting the repository should be able to clone it and do something like `run tests` and `run benchmarks`.
  2.  Create a `run.d` file, a `tests.d` file and a `benchmarks.d` file
  3.  When the user executes `rdmd run.d tests` it should compile the `tests.d` file and execute it producing a test report.
  4.  When the user executes `rdmd run.d benchmarks` it should compile `benchmarks.d`, execute it producing a benchmark report.

// NOTE(stefanos): I'm relatively satisfied with Dmemset. Dmemmove got better the last 3
// days but it probably still needs review / more work.

#  Use the `tests.d` file to implement a thorough test suite for each repository including edge cases.
  * It should test each kind of type (basic types, structs, classes, static arrays, and dynamic arrays). // NOTE(stefanos): Again, for the classes refer to the Blockers.
  * Where relevant it should include a test of all interesting sizes.
  * Where relevant, it should test all variations of alignment up to 32.  This includes aligned-src & aligned-dst, unaligned-src & unaligned-dst, aligned-src & unaligned-dst, and unaligned-src and aligned-dst.  A nested foreach look (e.g.  `foreach (srcOffset, alignments) { foreach(dstOffset; alignments) { ... } }`) should cover it.

// NOTE(stefanos): This is not done as proposed here. I had my own variation
// for alignment testing and this alternative was to be considered. My own, and this,
// still need review.

  * For memmove it should test all variations of overlap:  no overlap, exact overlap, source leading destination, destination leading source, etc...
  * Make sure each repository passes the test suite
  * Make sure the tests are easily comprehendible.  Keep them simple so any visitor to the repository can easily verify that the test suite is thorough.
  * Be sure the tests cover all implementations.

#  Use the `benchmarks.d` file to create a benchmark suite for each repository
  * Benchmark all sizes from at least 0~512 (preferably up to 1024).  After 1024 exponentially increasing sizes up to at least 65536.  They do not need to be powers of 2; consider even powers of 10 so it is easy to graph on a logarithmic scale.  An average of alignments is good for an overview, but the user should also be able to pick a single size and see how it performs for all variations of alignments.

// NOTE(stefanos): I don't test that many sizes in experimental branch since the compile
// time explodes. Meaning to the point that freezes Visual Studio.
// But I should have added a logarithmic scale, that was an overlook.

  * Be sure the benchmark is thorough enough to covers all implementations.



There is of course a lot to be said about the actual implementations and the decisions
taken but I guess the post would be very big, so I decided to focus on the final goals
and on the blockers. Please feel free to ask more specific questions on the implementations.




[1] https://github.com/baziotis/Dmemmove/tree/experimental - experimental branch
[2] https://github.com/baziotis/Dmemset
June 28, 2019
On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
> // NOTE(stefanos): Classes are not tested, more on that on the Blockers.
>

=== Blockers ===

-- Blocker 1 - DMD --

The main blocker was that the project was targetted to DMD. The main problems are:
- The optimizer is limited.
- The code generated is a lot of times unpredictable (at least to me).
That is, both as far as performance is concerned but comprehensibility as well.
- Inline ASM can't be interleaved with pure D.

I want to stress that when writing such performance sensitive utilities, the language
is used as the tool to generate the ASM more effieciently (and with less errors) instead
of writing it yourself. This is a subjective opinion, but I guess that most people
having worked on such utilities will agree.
This is why these utilities are either written in ASM or in a language that is low-level
enough and with a good enough optimizer that will let them write in this more high-level language.

Now, I picked inline ASM as my preference because with pure D and DMD there was:
- Poor debugability. When the ASM is not hand-written, it is not as easily comprehensible.
To sacrifice that, the ASM generated from the compiler has to be predictable, which for me it wasn't.

- Poor tuning. One should not fight the optimizer. If I expect an optimization to be done
and it's not, then that's a problem.

- Poor scalabitliy. If a person after me comes and tries to optimize it further, I might have potentially created more problems with pure D than what I would have solved. For example, if I was that person and I did a compile and there was an unexpected load inside a loop that I can't get around by transforming the code, then that would be a problem.
Basically, if we go the pure-whatever-language-we-choose way, we must never, in the future, say "Better have written in ASM from the start". And my prediction was that that would be the case.

I can be a lot more specific on the reasons behind the pick of inline ASM, so feel free to ask.

Don't get me wrong, DMD is pretty good but, at least I, could not get it to the point
of hand-written ASM.
I want to say that this inline ASM I'm talking about is being minimized / removed and is replaced with pure D for various reasons.

-- Blocker 2 - Test suite --

In this month, I was working with a test suite that I had not examined carefully.
That was certainly my biggest mistake up until now. And that test suite was not good.
When I got advised to make a new test suite, that new suite revealed serious bugs in the code. That was both good and bad. The good thing was that I now had the chance to think
hard on the test suite and that of course the bug were revealed.
But the bad part was that Dmemcpy and Dmemmove had to almost be complete remade in 3 days.
It was done, but it was a serious blocker.

In that time, problems with Windows were revealed (specifically, the calling convention),
which were also solved, but that was a lot of spent time as well.

-- Blocker 3 - Classes --

The problem with classes is that it is mentioned that the compiler can change the layout
of the fields in a class / struct. Even if that means that the two hidden fields
(vptr and monitor) are still on the start, it still seems hacky to take the class
pointer, move forward 16 bytes and start the operations there (and the 16 bytes is not standard because the pointer size changes by the operating system). So, we decided
to leave it for now.
My guess is that classes probably will never be used directly in such low-level code.

-- Blocker 4 - SIMD intrinsics --

When I started writing Dmemset, I decided to go pure-D first. In that effort, there
were 2 ASM instructions that I was trying to get them work for about 4 hours. The ASM
instructions are:
        movd    XMM0, ESI;
        pshufd  XMM0, XMM0, 0;

I don't if more details on what I tried matter, but if anyone has an idea, please inform me.
June 28, 2019
On Friday, 28 June 2019 at 12:14:13 UTC, Stefanos Baziotis wrote:
> On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
>> // NOTE(stefanos): Classes are not tested, more on that on the Blockers.
>>
>
> === Blockers ===
>
> -- Blocker 1 - DMD --
>
> The main blocker was that the project was targetted to DMD. The main problems are:
> - The optimizer is limited.
> - The code generated is a lot of times unpredictable (at least to me).
> That is, both as far as performance is concerned but comprehensibility as well.
> - Inline ASM can't be interleaved with pure D.
>
> I want to stress that when writing such performance sensitive utilities, the language
> is used as the tool to generate the ASM more effieciently (and with less errors) instead
> of writing it yourself. This is a subjective opinion, but I guess that most people
> having worked on such utilities will agree.
> This is why these utilities are either written in ASM or in a language that is low-level
> enough and with a good enough optimizer that will let them write in this more high-level language.
>
> Now, I picked inline ASM as my preference because with pure D and DMD there was:
> - Poor debugability. When the ASM is not hand-written, it is not as easily comprehensible.
> To sacrifice that, the ASM generated from the compiler has to be predictable, which for me it wasn't.
>
> - Poor tuning. One should not fight the optimizer. If I expect an optimization to be done
> and it's not, then that's a problem.
>
> - Poor scalabitliy. If a person after me comes and tries to optimize it further, I might have potentially created more problems with pure D than what I would have solved. For example, if I was that person and I did a compile and there was an unexpected load inside a loop that I can't get around by transforming the code, then that would be a problem.
> Basically, if we go the pure-whatever-language-we-choose way, we must never, in the future, say "Better have written in ASM from the start". And my prediction was that that would be the case.
>
> I can be a lot more specific on the reasons behind the pick of inline ASM, so feel free to ask.
>
> Don't get me wrong, DMD is pretty good but, at least I, could not get it to the point
> of hand-written ASM.
> I want to say that this inline ASM I'm talking about is being minimized / removed and is replaced with pure D for various reasons.

inline asm is generally very bad for the optimiser because is can have any side-effects and is completely opaque. It is possible to generate the asm with string mixins, see e.g. the BigInt routines in phobos.

You should test your work with LDC at some point which has an optimiser worth using, but note the bit about opaque inline ASM hurting performance.

> -- Blocker 2 - Test suite --
>
> In this month, I was working with a test suite that I had not examined carefully.
> That was certainly my biggest mistake up until now. And that test suite was not good.
> When I got advised to make a new test suite, that new suite revealed serious bugs in the code. That was both good and bad. The good thing was that I now had the chance to think
> hard on the test suite and that of course the bug were revealed.
> But the bad part was that Dmemcpy and Dmemmove had to almost be complete remade in 3 days.
> It was done, but it was a serious blocker.
>
> In that time, problems with Windows were revealed (specifically, the calling convention),
> which were also solved, but that was a lot of spent time as well.
>
> -- Blocker 3 - Classes --
>
> The problem with classes is that it is mentioned that the compiler can change the layout
> of the fields in a class / struct. Even if that means that the two hidden fields
> (vptr and monitor) are still on the start, it still seems hacky to take the class
> pointer, move forward 16 bytes and start the operations there (and the 16 bytes is not standard because the pointer size changes by the operating system). So, we decided
> to leave it for now.
> My guess is that classes probably will never be used directly in such low-level code.

You should be able to get the offset of the first member with

int foo()
{
    static class A { int a; }
    return A.init.a.offsetof;
}

which will apply to any other non-nested class.

>
> -- Blocker 4 - SIMD intrinsics --
>
> When I started writing Dmemset, I decided to go pure-D first. In that effort, there
> were 2 ASM instructions that I was trying to get them work for about 4 hours. The ASM
> instructions are:
>         movd    XMM0, ESI;
>         pshufd  XMM0, XMM0, 0;
>
> I don't if more details on what I tried matter, but if anyone has an idea, please inform me.

Take a look at https://github.com/AuburnSounds/intel-intrinsics

Keep up the good work!