January 09, 2019
On Wednesday, 9 January 2019 at 17:40:38 UTC, H. S. Teoh wrote:
> [snip]
>
> EXACTLY!!!
>
> Some time ago I took an interest in implementing the equivalent of strchr in the most optimized way possible. For that, I wrote several of my own algorithms and also perused the glibc implementation.
>
> Eventually, I realized that the glibc implementation, which uses fancy 64-bit-word scanning with a lot of setup overhead and messy starting/trailing cases, is optimizing for very large scans, i.e., when the byte being sought occurs only rarely in a very large haystack.  In those cases it's at the top of benchmarks.  However, in the arguably more common case where the byte being sought occurs relatively frequently in small- to medium-sized haystacks, repeatedly searching the haystack incurs a ton of overhead setting up all that fancy machinery, branch hazards, and what-not, where a plain ole `while (*ptr++ != needle) {}` works much better.
>
> I suspect many of the C library functions of this sort (incl. memcpy + friends) have a tendency to suffer from this sort of premature optimization.
>
> Not to mention that often overly-specialized benchmarks of this sort fail to account for bias caused by the CPU's branch predictor learning the benchmark and the cache hierarchy amortizing the cost of repeatedly searching the same haystack -- things you rarely do in real-life applications.  There's a big risk of your "super-optimized" algorithm ending up optimizing for an unrealistic use-case, but having only mediocre or sometimes even poor performance in real-world computations.


One thing I like about libmir's sum function
http://docs.algorithm.dlang.io/latest/mir_math_sum.html
was that the algorithm you use to return the sum can be chosen with an enum on the template. So it's really a collection of different sum algorithms all in one. Set the default as something reasonable and then let the user decide if they want something else.

January 09, 2019
On 2019-01-09 12:49, Mike Franklin wrote:

> In DMD you can't use it without linking in the runtime, but in LDC and GDC, you can.  One of the goals of implementing these runtime hooks as templates is to make more features available in -betterC builds, or for pay-as-you-go runtime implementations. If you need to link in druntime to get `alloca`, you can't implement the runtime hooks as templates and have them work in -betterC.

Ah, I see.
> Yes, it's possible, but I don't think it will ever be accepted if it doesn't perform at least as well as the optimized versions in C or assembly that use AVX512 or other SIMD features.  It needs to be at least as good as what libc provides, so we need to be able to leverage these unique hardware features to get the best performance.

Perhaps it could be considered as a fallback when a "memcpy" isn't available.

-- 
/Jacob Carlborg
January 09, 2019
On Wed, Jan 09, 2019 at 06:55:30PM +0000, jmh530 via Digitalmars-d wrote: [...]
> One thing I like about libmir's sum function
> http://docs.algorithm.dlang.io/latest/mir_math_sum.html
> was that the algorithm you use to return the sum can be chosen with an
> enum on the template. So it's really a collection of different sum
> algorithms all in one. Set the default as something reasonable and
> then let the user decide if they want something else.

That's an excellent idea.  Have a generic default algorithm that performs reasonably well in typical use cases, but also give the user the power to choose a different algorithm if he knows that it would work better with his particular use case.

Empowering the user -- over time I've come to learn that this is always the best approach to API design.  It's one that has the best chance of standing the test of time.  Fancy APIs that don't pay enough attention to this principle tend to eventually fade into obscurity.


T

-- 
I am Ohm of Borg. Resistance is voltage over current.
January 10, 2019
On Wednesday, 9 January 2019 at 12:31:13 UTC, Patrick Schluter wrote:

> AVX512 concerns only a very small part of processors on the market (Skylake, Canon Lake and Cascade Lake). AMD will never implement it and the number of people upgrading to one of the lake cpus from some recent chip is also not that great.

Yes, I agree, and even the newer chips have "Enhanced REP MOVSB and STOSB operation (ERMSB)" which can compensate.  See https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 3.7.6.

> I don't see why not having it implemented yet is blocking anything. People who really need AVX512 performance will have implemented memcpy themselves already and for the others, they will have to wait a little bit. It's not as if it couldn't be added later. I really don't understand the problem.

I remember analyzing other implementations of `memcpy` and they were all using AVX512.  I had faith in the authors of those implementations (e.g. Agner Fog) that they knew more than me, so that was what I should be using. Perhaps I should revisit it and just do the best that DMD can do.

But also keep in mind that there's a strategy to getting things accepted in DMD and elsewhere.  You are often battling perception.  The single most challenging aspect of implementing `memcpy` in D is overcoming bias and justifying it to the obstructionists that see it as a complete waste of time.  If I can't implement it in AVX512 simply for the purpose of measurement and comparison, it will be more difficult to justify.

> This said, another issue with memcpy that very often gets lost is that, because of the fancy benchmarking, its system performance cost is often wrongly assessed, and a lot of heroic efforts are put in optimizing big block transfers, while in reality it's mostly called on small (postblit) to medium blocks. Linus Torvalds had once a rant on that subject on realworldtech.
> https://www.realworldtech.com/forum/?threadid=168200&curpostid=168589

I understand.  I also encountered a lot of difficulting getting consistent measurements in my exploration.  Doing proper measurement and analysis for this kind of thing is a skill in and of itself.

You're right about the small copies being the norm.  As part of my exploration, I write a logging `memcpy` wrapper to see what kind of copies DMD was doing when it compiled itself, and it was as you describe.

Perhaps I'll give it another go at a later time, but we need to get dynamic stack allocation working first because many of the runtime hook implementations that will utilize `memcpy` do some error checking and assertions, and we need to be able to generate dynamic error messages for those assertions when the caller is `pure`.  We need a solution to this (https://issues.dlang.org/show_bug.cgi?id=18788) first.

Mike


January 10, 2019
On Wednesday, 9 January 2019 at 19:24:28 UTC, Jacob Carlborg wrote:

>> Yes, it's possible, but I don't think it will ever be accepted if it doesn't perform at least as well as the optimized versions in C or assembly that use AVX512 or other SIMD features.  It needs to be at least as good as what libc provides, so we need to be able to leverage these unique hardware features to get the best performance.
>
> Perhaps it could be considered as a fallback when a "memcpy" isn't available.

I'm not sure what you mean. DMD currently links in libc, so `memcpy` is always available.

Also, it's difficult for me to articulate, but we don't want `void* memcpy(void* destination, const void* source, size_t num)` rewritten in D.  We need `void memcpy(T)(T* destination, const T* source)` or some other strongly typed template like that. And as an aside, thanks to https://github.com/dlang/dmd/pull/8504 we now have to be careful about the order of arguments.

Anyway, I'm not sure there's much point in hashing this out right now.  We need dynamic stack allocation first before any of this can happen because the runtime hooks need to be able to generate dynamic assertion messages in -betterC, and there's only one person I know of that can do that (Walter), and I don't think it's a priority for him right now.

Mike
January 10, 2019
On Wednesday, 9 January 2019 at 19:25:35 UTC, H. S. Teoh wrote:

> That's an excellent idea.  Have a generic default algorithm that performs reasonably well in typical use cases, but also give the user the power to choose a different algorithm if he knows that it would work better with his particular use case.
>
> Empowering the user -- over time I've come to learn that this is always the best approach to API design.  It's one that has the best chance of standing the test of time.  Fancy APIs that don't pay enough attention to this principle tend to eventually fade into obscurity.

Yes, this is one of the benefits of making `memcpy(T)(T* dest, T* src)` instead of `memcpy(void* dest, void* src, size_t num)`.  One can generate a `memcpy` at compile-time that is optimized to the machine that the program is being compiled on (or for).  druntime could expose "memcpy configuration settings" for users to tune at compile-time.

But, then you have to deal with distribution of binaries.  If you are compiling a binary that you want to be able to run on all Intel 64-bit PCs, for example, you can't do that tuning at compile-time; it has to be done at runtime.  Assuming my understanding is correct, Agner Fog's implementation sets a function pointer to the most optimized implementation for the machine the program is running on based on an inspection fo the CPU's capabilities at the first invocation of `memcpy`.

There's a lot of things like this to consider in order to create a professional `memcpy` implementation.

Personally, I'd just like to put the infrastructure in place so those more talented than I can tune it. But as I said before, that first PR that puts said infrastructure in place needs to be justified, and I predict it will be difficult to overcome bias and perception.  Reading the comments in this thread fills me with a little more optimism that I'm not the only one who thinks it's a good idea. But, we still need dynamic stack allocation first before any of this can happen.

Mike

January 10, 2019
On Thursday, 10 January 2019 at 00:10:18 UTC, Mike Franklin wrote:
> I remember analyzing other implementations of `memcpy` and they were all using AVX512.  I had faith in the authors of those implementations (e.g. Agner Fog) that they knew more than me, so that was what I should be using. Perhaps I should revisit it and just do the best that DMD can do.

AVX512 is a superset of AVX2, is a superset of AVX, is a superset of SSE. I expect the implementations you were looking at are actually implemented in SSE, where SSE2 is a baseline expectation for x64 processors.

I've done some AVX2 code recently with 256-bit values. The performance is significantly slower on AMD processors. I assume their pipeline internally is still 128 bit as a result,
and while my 256-bit code can run faster on Intel it needs to run on AMD so I've dropped to 128-bit instructions at most - effectively keeping my code SSE4.1 compatible.

I've done a memset_pattern4[1] implementation in SSE previously. The important instruction group is _mm_stream. Which, you will note, was an instruction group first introduced in SSE1 and hasn't had additional writing stream functions added since since SSE 4.1[2].

[1] https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/memset_pattern4.3.html
[2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5119,5452,5443,5910,5288,5119,5249,5231&text=_mm_stream
January 10, 2019
On Thursday, 10 January 2019 at 10:13:57 UTC, Ethan wrote:
> I've done a memset_pattern4[1] implementation in SSE previously. The important instruction group is _mm_stream. Which, you will note, was an instruction group first introduced in SSE1 and hasn't had additional writing stream functions added since since SSE 4.1[2].

Where's the edit button. The last writing stream function was added in SSE2. A streaming load was added in SSE 4.1 I believe I used that load when optimising string compares.


January 10, 2019
On Thursday, 10 January 2019 at 10:13:57 UTC, Ethan wrote:
> On Thursday, 10 January 2019 at 00:10:18 UTC, Mike Franklin wrote:
>> I remember analyzing other implementations of `memcpy` and they were all using AVX512.  I had faith in the authors of those implementations (e.g. Agner Fog) that they knew more than me, so that was what I should be using. Perhaps I should revisit it and just do the best that DMD can do.
>
> AVX512 is a superset of AVX2, is a superset of AVX, is a superset of SSE. I expect the implementations you were looking at are actually implemented in SSE, where SSE2 is a baseline expectation for x64 processors.
>
> I've done some AVX2 code recently with 256-bit values. The performance is significantly slower on AMD processors. I assume their pipeline internally is still 128 bit as a result,
> and while my 256-bit code can run faster on Intel it needs to run on AMD so I've dropped to 128-bit instructions at most - effectively keeping my code SSE4.1 compatible.
>
> I've done a memset_pattern4[1] implementation in SSE previously. The important instruction group is _mm_stream. Which, you will note, was an instruction group first introduced in SSE1 and hasn't had additional writing stream functions added since since SSE 4.1[2].
>
> [1] https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/memset_pattern4.3.html
> [2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5119,5452,5443,5910,5288,5119,5249,5231&text=_mm_stream

That's disappointing to learn. Ryzen has four 128-bit AVX units, 2 of them can only do addition and the other 2 can only do multiplication. Not sure how the memory is shared between units but if it isn't then it'd need to copy to be able to do an addition then a multiplication.
January 11, 2019
On Thursday, 10 January 2019 at 21:01:09 UTC, luckoverthere wrote:
> That's disappointing to learn. Ryzen has four 128-bit AVX units, 2 of them can only do addition and the other 2 can only do multiplication. Not sure how the memory is shared between units but if it isn't then it'd need to copy to be able to do an addition then a multiplication.

The good news though is that Ryzen's 128-bit pipeline outperforms my Skylake i7 with this code. So you could say they've optimised for the majority usecase.

It's reaaaaaally beneficial to do 256-bit logic for my particular use case here since I'm sampling and operating on 8 32-bit values at a time to produce a 32-bit output. But eh, I've gotta write for the build farm hardware.