D perfomance (page 6)

April 27, 2020

Re: D perfomance

Posted by Steven Schveighoffer
in reply to Mathias LANG

Permalink

Steven Schveighoffer

Posted in reply to Mathias LANG

Permalink

On 4/27/20 1:04 AM, Mathias LANG wrote:
> On Sunday, 26 April 2020 at 16:20:19 UTC, Steven Schveighoffer wrote:
>>
>> In terms of performance, depending on the task at hand, D1 code is slower than D2 appending, by the fact that there's a thread-local cache for appending for D2, and D1 only has a global one-array cache for the same. However, I'm assuming that since you were focused on D1, your usage naturally was written to take advantage of what D1 has to offer.
>>
>> The assumeSafeAppend call also uses this cache, and so it should be quite fast. But setting length to 0 is a ton faster, because you aren't calling an opaque function.
>>
>> So depending on the usage pattern, D2 with assumeSafeAppend can be faster, or it could be slower.
> 
> Well, Sociomantic didn't use any kind of multi-threading in "user code".
> We had single-threaded fibers for concurrency, and process-level scaling for parallelism.
> Some corner cases were using threads, but it was for low level things (e.g. low latency file IO on Linux), which were highly scrutinized and stayed clear of the GC AFAIK.
> 
> Note that accessing TLS *does* have a cost which is higher than accessing a global. 

That is a minor cost compared to the actual appending.

> By this reasoning, I would assume that D2 appending would definitely be slower, although I never profiled it.

I tested the performance when I added the feature. D2 was significantly and measurably faster (at least for the appending 2 or more arrays). I searched through my old email, for appending 5M bytes to 2 arrays the original code was 13.99 seconds (on whatever system I was using in 2009), and 1.53 seconds with the cache.

According to that email, I had similar results even with a 1-element cache, so somehow my code was faster, but I didn't know why. Quite possibly it's because the cache in D1 for looking up block info is behind the GC lock.

Literally the only thing that is more expensive in D2 vs. D1 was the truncation of arrays. In D1 this is setting the length to 0, in D2, you needed to call assumeSafeAppend. This is why I suggested a flag that allows you to enable the original behavior.

> What I did profile tho, is `assumeSafeAppend`. The fact that it looks up GC metadata (taking the GC lock in the process) made it quite expensive given how often it was called (in D1 it was simply a no-op, and called defensively).

The cache I referred to is to look up the GC metadata. In essence, when you append, you will look it up anyway. Either assumeSafeAppend or append will get the GC metadata into the cache, and then it is a straight lookup in the cache and this doesn't take a lock or do any expensive searches. The cache is designed to favor the most recent arrays first.

This is an 8 element cache, so there are still cases where you will be having issues (like if you round-robin append to 9 arrays). I believe 8 elements was a sweet spot for performance that allowed reasonably fast appending with a reasonable number of concurrent arrays.

Where D1 will fall down is if you are switching between more than one array, because the cache in D1 is only one element. Even if you are doing just one array, the cache is not for the array runtime, but for the GC. And it is based on the pointer queried, not the block data. A GC collection, for instance, is going to invalidate the cache.

> 
>>> IIRC Mathias has suggested that it should be possible to tag arrays as intended for this kind of re-use, so that stomping prevention will never trigger, and you don't have to `assumeSafeAppend` each time you reduce the length.
>>
>> I spoke for a while with Dicebot at Dconf 2016 or 17 about this issue. IIRC, I suggested either using a custom type or custom runtime. He was not interested in either of these ideas, and it makes sense (large existing code base, didn't want to stray from mainline D).
>>
>> By far, the best mechanism to use is a custom type. Not only will that fix this problem as you can implement whatever behavior you want, but you also do not need to call opaque functions for appending either. It should outperform everything you could do in a generic runtime.
> 
> Well... Here's nothing I never really quite understood actually: Mihails *did* introduce a buffer type. See https://github.com/sociomantic-tsunami/ocean/blob/36c9fda09544ee5a0695a74186b06b32feda82d4/src/ocean/core/Buffer.d#L116-L130 
> 
> And we also had a (very old) similar utility here: https://github.com/sociomantic-tsunami/ocean/blob/36c9fda09544ee5a0695a74186b06b32feda82d4/src/ocean/util/container/ConcatBuffer.d 
> 
> I always wanted to unify this, but never got to it. But if you look at the first link, it calls `assumeSafeAppend` twice, before and after setting the length. In practice it is only necessary *after* reducing the length, but as I mentioned, this is defensive programming.

Yeah, that is unnecessary. It is not going to be that expensive, especially if you just were appending to that array, but again, more expensive than setting a word to 0.

> 
> For reference, most of our applications had a principled buffer use. The buffers would rarely be appended to from more than one, perhaps two places. However, slices to the buffer would be passed around quite liberally. So a buffer type from which one could borrow would indeed have been optimal.

This all works actually better with the new runtime. The old one would reallocate if you appended to a slice that didn't start at the block start. The new version can detect it's at the end and allow appending.

> 
>> Note that this was before (I think) destructor calls were added. The destructor calls are something that assumeSafeAppend is going to do, and won't be done with just setting length to 0.
>>
>> However, there are other options. We could introduce a druntime configuration option so when this specific situation happens (slice points at start of block and has 0 length), assumeSafeAppend is called automatically on the first append. Jonathan is right that this is not @safe, but it could be an opt-in configuration option.
>>
>> I don't think configuring specific arrays makes a lot of sense, as this would require yet another optional bit that would have to be checked and allocated for all arrays.
>>
> 
> I don't even know if we had a single case where we had arrays of objects with destructors. The vast majority of our buffer were `char[]` and `ubyte[]`. We had some elaborate types, but I think destructors + buffer would have been frowned upon in code review.

Of course! D1 didn't have destructors for structs ;)

> 
> Also the reason we didn't modify druntime to just have the D1 behavior (that would have been a trivial change) was because how dependent on the new behavior druntime had become. It was also the motivation for the suggestion Joe mentioned. AFAIR I mentioned it in an internal issue, did a PoC implementation, but never got it to a state were it was mergeable.

Having a flag per array is going to be costly, but actually, there's a lot more junk in the block itself. Perhaps there's a spare bit somewhere that can be a flag for the append behavior.

> 
> Also, while a custom type might sound better, it doesn't really interact well with the rest of the runtime, and it's an extra word to pass around (if passed by value).

The "extra value" can be stored elsewhere -- just like the GC you could provide metadata for the capacity in a global AA or something.

In any case, there were options. The way druntime is written, it's pretty good performance, in most cases BETTER performance than D1 for idiomatic D code. In fact the Tango folks asked me if I could add the feature to Tango's druntime, but I couldn't because it depends on TLS.

For code that was highly focused on optimizing D1 with its idiosyncracies, it probably has worse performance.

The frustration is understandable, but without the possibility of adaptation, there's not much one can do.

-Steve

On 27.04.20 07:40, Walter Bright wrote: > On 4/26/2020 2:52 PM, Timon Gehr wrote: >> I can't do that because you did not agree it was a bug. According to your DIP and past discussions, the following is *intended* behavior: >> >> int bar(ref int x,ref int y)@safe @live{ >> x=0; >> y=1; >> return x; >> } >> >> void main()@safe{ >> int x; >> import std.stdio; >> writeln(bar(x,x)); // 1 >> } >> >> I have always criticized this design, but so far you have stuck to it. I have stated many times that the main reason why it is bad is that you don't actually enforce any new invariant, so @live does not enable any new patterns at least in @safe code. >> >> In particular, if you start optimizing based on non-enforced and undocumented @live assumptions, @safe @live code will not be memory safe. >> >> You can't optimize based on @live and preserve memory safety. Given that you want to preserve interoperability, this is because it is tied to functions instead of types. @live in its current form is useless except perhaps as a linting tool. > > @live's invariants rely on arguments passed to it that conform to its requirements. It's analogous to @safe code relying on its arguments conforming. > ... No, it is not analogous, because only @system or @trusted code can get that wrong, not @safe code. @safe code itself is (supposed to be) verified, not trusted. > To get the checking here, main would have to be declared @live, too. I understand the design. It just does not make sense. All of the code is annotated @safe, but if you optimize based on unverified assumptions, it will not be memory safe. Is the goal of @live really to undermine @safe's guarantees?

On 4/27/2020 7:26 AM, Timon Gehr wrote: > I understand the design. It just does not make sense. All of the code is annotated @safe, but if you optimize based on unverified assumptions, it will not be memory safe. It is a good point. The design of @live up to this point did not change the way code was generated. I still want to see how much of a difference it makes, and will implement it but make it an option.

On Monday, 27 April 2020 at 14:26:50 UTC, Timon Gehr wrote: > > No, it is not analogous, because only @system or @trusted code can get that wrong, not @safe code. @safe code itself is (supposed to be) verified, not trusted. the existence of any overly @trusting code renders @safe code liable to cause memory safety bugs. While the invalid accesses won't occur inside @safe code, they can definitely be caused by them, even without the buggy @safe code calling any @trusted. Some day I'll have time to write up all my (many, many pages of) notes for this stuff... would have been for dconf, I guess now for dconf online?

On Tuesday, 28 April 2020 at 13:44:05 UTC, John Colvin wrote: > On Monday, 27 April 2020 at 14:26:50 UTC, Timon Gehr wrote: >> >> No, it is not analogous, because only @system or @trusted code can get that wrong, not @safe code. @safe code itself is (supposed to be) verified, not trusted. > > the existence of any overly @trusting code renders @safe code liable to cause memory safety bugs. While the invalid accesses won't occur inside @safe code, they can definitely be caused by them, even without the buggy @safe code calling any @trusted. > > Some day I'll have time to write up all my (many, many pages of) notes for this stuff... would have been for dconf, I guess now for dconf online? Would be eager to listen.

On 28.04.20 15:44, John Colvin wrote: > the existence of any overly @trusting code renders @safe code liable to cause memory safety bugs. While the invalid accesses won't occur inside @safe code, they can definitely be caused by them, even without the buggy @safe code calling any @trusted. I don't see how you arrive at "buggy @safe code" here. You say it yourself: When there is "overly @trusted code", then that's where the bug is.

On 29.04.20 00:36, ag0aep6g wrote: > On 28.04.20 15:44, John Colvin wrote: >> the existence of any overly @trusting code renders @safe code liable to cause memory safety bugs. While the invalid accesses won't occur inside @safe code, they can definitely be caused by them, even without the buggy @safe code calling any @trusted. > > I don't see how you arrive at "buggy @safe code" here. You say it yourself: When there is "overly @trusted code", then that's where the bug is. I guess he is talking about the case where @trusted code calls buggy @safe code and relies on it being correct to ensure memory safety. (However, this is still the fault of the @trusted code. @safe code cannot be blamed for violations of memory safety.)

On Friday, 24 April 2020 at 19:27:40 UTC, Arine wrote: > On Thursday, 23 April 2020 at 15:57:01 UTC, drug wrote: >> And your statement that Rust assembly output is better is wrong. > > There most definitely is a difference and the assembly generated with rust is better. This is just a simple example to illustrate the difference. If you don't know why the difference is significant or why it is happening. There are a lot of great articles out there, sadly there are people such as yourself spreading misinformation that don't know what a borrow checker is and don't know Rust or why it is has gone as far as it has. This is why the borrow checker for D is going to fail. Because the person designing it, such as yourself, doesn't have any idea what they are redoing and have never even bothered to touch Rust or learn about it. Anyways I'm not your babysitter, if you don't understand the above, as most people seem to not bother to learn assembly anymore, you're on your own. A competent C Programmer could just write something like this. Or use restrict... int test(int* x, int* y) { int result = *x = 0; *y = 1; return result; } Produce this with gcc -O test(int*, int*): mov DWORD PTR [rdi], 0 mov DWORD PTR [rsi], 1 mov eax, 0 ret https://godbolt.org/z/rpM_eK So the statement rust produces better assembly is wrong. I it's on my todo-list to learn rust. What is really off-putting are those random fanatic rust fanboys. In your language: "If you don't know why the difference is significant or why it is happening", you should probably learn C before you start insulting people in a programming forum ;)

On Wednesday, 29 April 2020 at 10:32:33 UTC, random wrote: > > A competent C Programmer could just write something like this. Or use restrict... > > int test(int* x, int* y) { > int result = *x = 0; > *y = 1; > return result; > } > I'm incompetent so I would just write: int test(int* x, int* y) { *x = 0; *y = 1; return 0; }

On Wednesday, 29 April 2020 at 10:32:33 UTC, random wrote: < I forgot to add this... Compile with gcc -O3: test(int*, int*): mov DWORD PTR [rdi], 0 xor eax, eax mov DWORD PTR [rsi], 1 ret https://godbolt.org/z/xW6w6W

Forums