June 28, 2019
On Friday, 28 June 2019 at 12:33:16 UTC, Nicholas Wilson wrote:
> inline asm is generally very bad for the optimiser because is can have any side-effects and is completely opaque.

Exactly, that's the primary reason I mentioned that inline asm can't
be interleaved with D. For performance reasons. The compiler has
to be very conservative (more than one would expect). Which means
that the only way to go is either pure D or full ASM and in fact, `_naked`
ASM.

> It is possible to generate the asm with string mixins, see e.g. the BigInt routines in phobos.
>

I suppose you mean this: https://github.com/dlang/phobos/blob/master/std/bigint.d
With a quick look I'm not sure I understand the reason to do string mixins.
I understand that it is for convenience (i.e. construct the ASM appropriately and not write a million different versions) and not performance reasons.

> You should test your work with LDC at some point which has an optimiser worth using, but note the bit about opaque inline ASM hurting performance.
>

It is tested with LDC but LDC was not a target for this project. Yes, inline ASM
is risky as is pure D for the reasons I said above. Maybe I should note explicitly
the risk of using only ASM as well, since I did for pure D.
It's a matter of compromise.

>
> You should be able to get the offset of the first member with
>
> int foo()
> {
>     static class A { int a; }
>     return A.init.a.offsetof;
> }
>
> which will apply to any other non-nested class.
>

Thanks, I had not considered that. I think I should do an explicit post
where I ask the opinion of the community about whether they would like
the support of classes and how so.


>
> Take a look at https://github.com/AuburnSounds/intel-intrinsics
>

Just a little bit more detail, from my research, it is supposed that these two instructions should correspond somehow to these 2 instructions:
    simd_stof!(XMM.LODD, void16)(v, XMM0);
    simd!(XMM.PSHUFD, 0, void16, void16)(XMM0, XMM0);

But I could not get them work for my life.

I had not considered the intel intrinsics which is dumb if you consider
that there is a whole talk I watched on this topic.
It is this: https://www.youtube.com/watch?v=cmswsx1_BUQ
for anyone interested.

> Keep up the good work!

Thank you!

- Stefanos


July 05, 2019
On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
>
> An update regarding the project. There was a lot of turbulance in this project, so I'm sorry I did not post earlier.
>

I'm now moving to weekly updates. Before the updates of what I did, let me update
you on the state of the project.
The focus of the project has changed in the following ways:
- No assembly
- Generic and portable code
- Focus on LDC and GDC
- PRs to core.experimental

This week
==========
- Because of the above, this week I started with the replacement of all the ASM
with SSE intrinsics and providing simple implementation for when SIMD is not available.
The goal was not only the replacement but also the optimization for LDC.
Eventually (either as part of this summer or of future work), the simple implementation
should not be so "simple" and be one that helps LDC and GDC optimize it without
the need to be explicitly in `version (D_SIMD)`.

- I moved the functions in a common repository: https://github.com/baziotis/Dmemutils

- I made a draft PR in the D runtime: https://github.com/dlang/druntime/pull/2662
(Thanks to lesderid and wilzbach for their help).


* A note on intel-intrinsics:
I first tried intel-intrinsics for the use of intrinsics.
That worked great in LDC (I think it's focused on LDC),
not so good in DMD and not at all in GDC.
Firstly, in DMD it didn't work meaning it generated "wrong" code. The problem is
that doing a load/store with intel-intrinsics and doing a load/store with load/storeUnaligned of core.simd does not generate the same code.
This is important for Dmemmove because it is very
sensitive to the order of instructions because of the overlap (e.g. here [1])

So, I made my own intrinsics that are different depending on if we use DMD or LDC.

Regarding GDC, I just couldn't get it compile.

My purpose is not to disparage intel-intrinsics, it's a great library. This was just
my experience, in which maybe I did something wrong. I tried also to contact
the creator, becase maybe he has some insight.

* A note on GDC intrinsics:
GDC now compiles to the naive version, because I don't know of load/storeUnaligned
respective functions for GDC.
Iain told me that I could use the i386 intrinsics
(which as far as I know is this [2]), but I could not use them in GDC.

Blockers
========

Only what I said above regarding GDC intrinsics.


[1] https://github.com/baziotis/Dmemutils/blob/master/Dmemmove/Dmemmove.d#L267
[2] https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html



Next week
==========
Sadly, I don't know. According to my schedule, the work on the allocator should have
started. But, there were a couple of problems in the project, which changed its focus and so there were things that had to be done that were not initially planned.
That means that the allocator, that should have started by now, hasn't.

Other than that, the plans for the allocator changed when the project started to things
that I'm not fully experienced with (changed from malloc(), free() etc. to using the std.experimental.allocator).

So, how the project will continue is currently an open discussion.

If std.experimental.allocator is interesting to the community, I'm happy to discuss it
and learn how to continue.

If we fall back to classic malloc(), free() implementations, this is something
that can't be fully done in the time available. To make a complete replacement
of malloc() et al, one has to make a serious attempt on multi-threading and optimization.

_HOWEVER_, one possible alternative is to provide minimalistic versions of those functions
for "baremetal" systems. That means either embedded systems or WASM. I think that this
is interesting, meaning, to not have a dependency on the libc there and have minimal
(regarding resources and code) implementations.
July 05, 2019
On Friday, 5 July 2019 at 11:02:00 UTC, Stefanos Baziotis wrote:
> - Because of the above, this week I started with the replacement of all the ASM
> with SSE intrinsics and providing simple implementation for when SIMD is not available.
> The goal was not only the replacement but also the optimization for LDC.
> Eventually (either as part of this summer or of future work), the simple implementation
> should not be so "simple" and be one that helps LDC and GDC optimize it without
> the need to be explicitly in `version (D_SIMD)`.
>

An important omission is that GDC and LDC optimize the simple version of Dmemset
for my AMD Ryzen in such a way that it reaches total parity with libc memset.
The difference is amazing when working with the LLVM / GCC back-ends.

Unfortunately, I don't have an Intel to test. It would be really good to have benchmarks from Intel users.
July 05, 2019
On Friday, 5 July 2019 at 15:42:48 UTC, Stefanos Baziotis wrote:
> Unfortunately, I don't have an Intel to test. It would be really good to have benchmarks from Intel users.

Hi Stefanos,

This is great work. I hope Phobos will move away from clib some day.

As for the benchmarks.
I think you can post your results somewhere. Or you did. Unfortunately I cannot find them.

I tested Dmemset with dmd (lcd and gdc didn't compile) on i3-3220@3.30GHz (Ubuntu).

The strange thing is I get different results when I change the following line in benchamrks.d
//note the upper bound
static foreach(i; 1..256)
 to
static foreach(i; 1..257)

(1..256)
127 24.1439 20.7726
128 24.333 20.8421
129 24.3768 20.9648

(1..257)
127 24.4276 25.8072
128 24.679 26.2316
129 24.8052 26.0236

So D version becomes better. Maybe this is related to different binary file after compilation.


Some other results for "(1..257)" variant:

size(bytes) Cmemmove(GB/s) Dmemmove(GB/s)
1 0.269991 0.180151
2 0.438143 0.386652
3 0.657527 0.543067
4 1.00408 0.767028
5 1.26435 0.96617
6 1.51675 1.09579
7 1.76942 1.2771
8 2.02263 1.54563
9 2.27596 1.6421
10 2.52917 1.82534
11 2.78175 2.00729
12 3.03507 2.1897
13 3.28674 2.37267
14 3.53581 2.54155
15 3.79338 2.59328
16 5.25561 2.91728
17 5.58319 5.07972
18 5.91207 5.37934
19 6.24159 5.67784
20 6.56863 5.97583
21 6.84187 6.26141
22 7.22644 6.57598
23 7.55238 6.81922
24 7.88487 7.17182
...
39 9.85228 9.48541
40 10.1054 9.72436
41 10.3587 10.0661
42 10.5787 10.3286
43 10.862 10.661
44 11.1155 10.9688
45 11.3691 11.2042
46 11.6228 11.5771
47 11.8245 11.6284
48 12.1258 12.1853
49 12.3849 12.4931
...
59 14.7853 15.7441
60 15.165 16.1076
61 15.4095 16.4647
62 15.6639 16.803
63 15.9273 17.0932
64 16.1733 17.4991
65 11.862 17.671
66 12.0373 17.8678
67 12.2148 17.8533
68 12.4066 18.2475
69 12.5497 18.2762
...
124 23.6536 25.3192
125 23.9933 25.5515
126 24.2049 26.0169
127 24.4276 25.8072
128 24.679 26.2316
129 24.8052 26.0236
130 25.0353 26.446
131 24.8123 26.2339
132 25.2592 26.176
133 25.3562 26.6108
134 25.8571 26.8894
...
252 33.7209 33.9282
253 33.7367 34.1942
254 33.8958 34.59
255 33.412 33.6378
256 33.6542 34.661
500 39.5868 39.6527
700 43.7852 43.3711
3434 34.2489 45.8683
7128 35.2755 49.4049
13908 35.5447 51.2273
16343 35.0748 51.4501
27897 35.5615 51.0826
32344 35.1398 48.1469
46830 32.8887 34.9705
64349 33.2305 34.9398

Are they meaningful for you?

If you want I can run additional benchmarks for you. For details, mabe we can continue on github. On forum we can discuss some fundamentals points.

Cheers,
Piotrek


July 06, 2019
On Friday, 5 July 2019 at 20:22:30 UTC, Piotrek wrote:
>
> Hi Stefanos,
>
> This is great work. I hope Phobos will move away from clib some day.
>

Hello, thank you! Yes, I hope too. If the D runtime moves away, that
will be easier for the rest of D.

> As for the benchmarks.
> I think you can post your results somewhere. Or you did. Unfortunately I cannot find them.

You're right, my mistake, there are no recent benchmarks. I'll try to post
today. They're similar to yours.

> I tested Dmemset with dmd (lcd and gdc didn't compile) on i3-3220@3.30GHz (Ubuntu).

That's weird. Could you give some more info on how did you compile?
Did you use the procedure described in the README?
Meaning, `rdmd run benchmarks gdc` and `rdmd run benchmarks ldc`.
Now I checked and there was a regression which is now fixed. But with this
regression, I could compile benchmarks for gdc but not ldc or dmd.

> The strange thing is I get different results when I change the following line in benchamrks.d
> So D version becomes better. Maybe this is related to different binary file after compilation.

That is indeed strange but not too unexpected. A compiler (more possible in
the DMD back-end) might decide to do strange things for reasons I don't know.
I'll try to re-create similar behavior in mine.

>
> Some other results for "(1..257)" variant:
>
> Are they meaningful for you?
>

They are, thank you! The benchmarks are good.

Just some more info for anyone interested:
Regarding sizes 1-16. With GDC / LDC, in my benchmarks
(and by reading the ASM, I assume in all the benchmarks), it reaches parity
with libc (note that for sizes 1-16 the naive version is used,
meaning, a simple for loop). Now, for such small sizes, the standard way to go
is a fall-through switch (I can give more info on that if someone is interested).
The problem with that is that it's difficult to be optimized along with the rest
of the code. Meaning, by the compiler. Or at least, I didn't find a way
to do it. And so, I use the naive version which is only slightly slower but
doesn't affect bigger sizes.

Another important thing is that +/- 1 GB/s should not be considered. The reason
is that at some point I benchmarked libc memset() against libc memset() and
there were +/- 1 GB/s differences.

> If you want I can run additional benchmarks for you.

Thanks, I don't want to pressure you. If you have time, I'm interested in
some feedback on GDC / LDC (if they compile and / or benchmarks).
My guess is that especially with GDC / LDC (and DMD, but I'm not yet sure
for DMD across different hardware), Dmemset can actually replace libc memset().

In Dmemmove / Dmemcpy is harder to have a clear winner.

> For details, mabe we can continue on github. On forum we can discuss some fundamentals points.

I'm available to you or anyone to give additional info / explanations etc.
on every line of code, decision, alternative implementations, possible
improvements etc. You can post here, contact me on Slack or email.
Some of these things will be added on the READMEs in the end, but we can
go in more detail.

Best regards,
Stefanos
July 06, 2019
On Friday, 5 July 2019 at 15:42:48 UTC, Stefanos Baziotis wrote:
>
> Unfortunately, I don't have an Intel to test. It would be really good to have benchmarks from Intel users.
>

A kind request to anyone interested in helping, please for the time being,
put a priority in Dmemset (as Piotrek did).
It is in a somewhat final and polished state
and so we can have a more fruitful discussion, without it undergoing big changes.

Dmemcpy / Dmemmove will undergo some changes (not fundamental I hope, but certainly
layout, naming etc.) before it can be PR'd to D runtime too.
July 06, 2019
On Saturday, 6 July 2019 at 11:07:41 UTC, Stefanos Baziotis wrote:
>> As for the benchmarks.
>> I think you can post your results somewhere. Or you did. Unfortunately I cannot find them.
>
> You're right, my mistake, there are no recent benchmarks. I'll try to post
> today. They're similar to yours.


>> I tested Dmemset with dmd (lcd and gdc didn't compile) on i3-3220@3.30GHz (Ubuntu).
>
> That's weird. Could you give some more info on how did you compile?

I used the old repo for Dmemset. With Dmemutils it works now. I removed static foreach from benchmark.d in order to run gdc.
Text results:
https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output

>
>> The strange thing is I get different results when I change the following line in benchamrks.d
>> So D version becomes better. Maybe this is related to different binary file after compilation.
>
> That is indeed strange but not too unexpected. A compiler (more possible in
> the DMD back-end) might decide to do strange things for reasons I don't know.
> I'll try to re-create similar behavior in mine.

It seems it wasn't related to this change. Looks like heisen optimization.

>
> Just some more info for anyone interested:
> Regarding sizes 1-16. With GDC / LDC, in my benchmarks
> (and by reading the ASM, I assume in all the benchmarks), it reaches parity
> with libc (note that for sizes 1-16 the naive version is used,
> meaning, a simple for loop). Now, for such small sizes, the standard way to go
> is a fall-through switch (I can give more info on that if someone is interested).
> The problem with that is that it's difficult to be optimized along with the rest
> of the code. Meaning, by the compiler. Or at least, I didn't find a way
> to do it. And so, I use the naive version which is only slightly slower but
> doesn't affect bigger sizes.

Funnily enough, DMD (with Dmemset) holds the speed record, over 50 GB/s, copying some big block sizes.
However, aren't smaller sizes more important?


> My guess is that especially with GDC / LDC (and DMD, but I'm not yet sure
> for DMD across different hardware), Dmemset can actually replace libc memset().

One issue is it should be tested on all variation of HW and OS.
At least it can be placed in experimental module.


Cheers,
Piotrek

July 06, 2019
On Saturday, 6 July 2019 at 15:33:44 UTC, Piotrek wrote:
>
>
> I used the old repo for Dmemset. With Dmemutils it works now. I removed static foreach from benchmark.d in order to run gdc.
> Text results:
> https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output
>

Great, earlier today I realized that there were problems with static foreach,
so now it's only using mixin in the main repo.

Basically, I should have been able to do:
version (GNU)
{
    // mixin
}
else
{
    static foreach
}

but that didn't work, meaning GDC tried to compile static foreach

Anyway, the benchmarks look good. In DMD, small sizes are not so good but the big
ones are better. But DMD is not the focus, since it now changed to GDC, LDC.

If you're interested, there are a lot of things to say regarding optimization for DMD. Some have been said in this thread as initially the project was focused on DMD. I'm actually thinking of writing an article so that maybe I can help the next guy that tries to optimize for DMD. I don't think it's a good decision to care at all about optimization in DMD, but one might do. And it's a hard road.
A tl;dr is that, for me at least, the only way to reach parity with libc is using (inline) ASM.

But the important benchmarks are for GDC, LDC, which agree with my benchmarks
on AMD and the result is that Dmemset reaches total parity with libc memset().
That's great to have from an Intel user as well, thanks for your time!

>
> It seems it wasn't related to this change. Looks like heisen optimization.
>

Again, DMD. Quite an unexpected compiler.

>
> Funnily enough, DMD (with Dmemset) holds the speed record, over 50 GB/s, copying some big block sizes.
>

DMD might have been able to get these results
due to inlining that was unrelated to the actual function (i.e. the benchmark code got inlined).

>
> However, aren't smaller sizes more important?
>

Again, fortunately DMD is not the focus but I guess one way one can somewhat answer this question is to do a report of the sizes used in the D runtime, since this is targeted to the D runtime.
Something like this: https://forum.dlang.org/post/jdfiqpronazgglrkmwfq@forum.dlang.org

But this is not enough. A big part of optimization is to know the most
common cases (which could be the data format, size, hardware etc.) and optimize
for that first. And this is not adequate to show us the most common cases.

- For one, eventually different sizes might be added or removed and so the
common cases might change.
- Someone might want to use this function outside of the D runtime.

So, Dmemset() should be even or better than libc, which is (currently) achieved.

Note something interesting. GDC gets these results with the naive version. This
version is literally a 8-lines for loop.

>
> One issue is it should be tested on all variation of HW and OS.
> At least it can be placed in experimental module.

Right, it's currently PR'd to the D runtime: https://github.com/dlang/druntime/pull/2662
Just like you said, in an experimental module. :P

Best regards,
Stefanos


July 20, 2019
On Friday, 5 July 2019 at 11:02:00 UTC, Stefanos Baziotis wrote:
>
> I'm now moving to weekly updates. Before the updates of what I did, let me update
> you on the state of the project.

Last 2 Weeks
============

I could not do weekly updates because unfortunately, there are a lot of things out of schedule in the project.

So basically, the last 2 weeks I improved memcpy() / memmove() so they can be PR'd to the druntime. This [1] was the first PR. It had to be moved into separate
PRs for memcpy() and memmove. Yesterday, an important question was answered which let me do a new PR for memcpy() [2]

Along with that, I created memcmp() replacement [3]. I'm relatively satisfied with how the code looks, but this can't be PR'd yet to the druntime due to performance
problems (more on that on the blockers).

Blockers
========

-- On memcmp:

That was my post on Slack:

There are 3 major problems:
1) The performance is really really bad. Assuming that I have not done something stupid, it's just really bad. And actually, the asm generated (from `LDC`) is really weird too.
2) I made a version of Agner Fog's `memcmp` (which incidentally is similar to mine, it just goes reverse and does some smart things in subtractions). The thing is:
   a) Mine and Agner's should be about the same but it's not (Agner's is way better).
   b) Agner's is still quite low compared to `libc`.
3) The `LDC` version gives some very weird results for `libc memcmp`. Meaning, in benchmarks. And actually, the -O3 ASM generated by LDC seems bad as well.

-- Τhe state of the project

Right now, there is no specific roadmap nor any specific goals moving forward.
The project was divided in 2 parts. One was the memcpy() et al. which included
memcpy(), memmove() and memcmp() and the second was the allocator.
The first part is mostly done. After discussions with Seb, we decided
that the second part is not really needed after the mimalloc() of Microsoft:
https://forum.dlang.org/thread/krsnngbaudausabfsqkn@forum.dlang.org

So, currently I don't know how to move forward. I asked on the druntime
whether I can help with anything and zombinedev and Nicholas Wilson proposed
refactorings on core.thread. Nicholas helped me to start with that, so this
is going to be the next thing I will do. But this is supposed to be quick.

If anyone has any proposal on what to do next, I'm glad to discuss.


[1] https://github.com/dlang/druntime/pull/2671
[2] https://github.com/dlang/druntime/pull/2687
[3] https://github.com/baziotis/Dmemutils/tree/master/Dmemcmp
July 24, 2019
On 06.07.19 18:10, Stefanos Baziotis wrote:
> 
> Basically, I should have been able to do:
> version (GNU)
> {
>      // mixin
> }
> else
> {
>      static foreach
> }
> 
> but that didn't work, meaning GDC tried to compile static foreach

It won't compile it, but it will attempt to parse it.

You should be able to do:

version(GNU){ /+mixin+/ }
else mixin(q{ /+static foreach+/ });