December 24, 2009
Steven Schveighoffer wrote:
> grauzone Wrote:
> 
>> Steven Schveighoffer wrote:
>>> The problem is that adding a capacity field only works if the object is a reference type.  A value type will make either choose to make copies of the capacity field, resulting in the same stomping problem (and maybe worse), or will zero out one of the copy's capacity field, resulting in an artificial limitation for appending.  The only true way to make this work right is to have a capacity field shared among all the aliases to the same data, and the most logical place for that is in the allocated block.
>> I see... Go makes slices value types (which references the array data), but I'm not sure how or if they handle this situation.
>>
>> Then, why use a cache to find the pointer to the capacity field inside the array memory block? You could just store a pointer to the capacity field in the slice.
>>
>> Something like this, I guess:
>>
>> struct slice {
>>      void* ptr;
>>      size_t length;
>>      array_info* array;
>> }
>>
>> [snip]
>>
>> Would this work?
> 
> Yes, but the problem I see with it is you are passing around that array_info pointer with the slice for *all* slices, not just ones you intend to use appending on.  There is plenty of code that uses arrays and slices without ever doing appending, so that extra pointer is wasted space, and also increases the cost of passing slices by value.

Suddenly having an additional field in the slice is the problem? Wasn't semantics the problem (see your first reply to me above in this thread).

Is having an additional field in the slice really a problem?

You could just provide a "light" slice type. This wouldn't have the additional helper field, and would consist only of a pointer and length field. You could make it implicitly convertible to normal slices.

These light slices could be used by someone who needs the additional performance so badly. Actually, they could simply be implemented as library type.

I see no problem with adding an additional field to usual slices. There's also some benefit: clearer semantics (if you make guaranteed performance part of the semantics) and a much simpler runtime.

> Your idea is essentially the idea behind my patch, except you can get the array info without having an additional pointer.  The cost of lookup for the block info is mitigated by the cache, making lookups cost very little.

But the cache is volatile, and a cache entry could be easily flushed by doing many array append operations. A single string processing function could replace the entire cache with entries for string temporaries, and cause nasty performance regressions in unrelated parts of the program (e.g. in places where you really don't want each append operation to reallocate the full array). The worst is, this behavior will be relatively indeterministic. It's also hard to explain to "normal" programmers.

Also, isn't it completely useless? If the programmer wants to append to an array, he'll use an append struct just to be sure, and if he doesn't want to append, inefficient array slice appending wouldn't be a problem.

You could avoid these issues just by adding a new field to the slice struct.

Needless to say, T[new] would be the correct solution here. It also has cleaner semantics. Yes, I heard it sucked and it was considered a failure by Walter/Andrei, but we still don't know why.

> The patch is a compromise between performance for appending and performance for everything else.  I think there will still be a need in specialized code for a fatter array type that allows appending without complicated GC lookups, but the patch provides safe appending for a reasonable performance hit without affecting other aspects of array usage.  As with all compromises, there are always different ways of doing things, so time will tell if this patch is the best method.
> 
> -Steve
December 24, 2009
Steven Schveighoffer wrote:
> Leandro Lucarella Wrote:
> 
>> This is making the half-value, half-reference semantics of arrays
>> even worse, since now the capacity is "passed by reference" (at the
>> end of the array data) but the length and ptr as values.
> 
> This is absolutely false.  It does not make anything worse.  It may
> not make it as good as you want, but it definitely is better than the
> current situation.  The current situation is that the capacity is
> "invented" by the runtime to be exactly what you need if the slice is
> at the beginning of a block, or 0 if it's not.  Such a scheme is
> totally unsafe.  The proposed patch leaves all aspects of arrays and
> slices alone *except* for this one thing.
> 
>> I don't comment much about the patch (and the design) because I
>> already did so in the several threads about it. I think is a very
>> bad idea, I think dynamic arrays should be a proper reference type
>> (with ptr, length and capacity) and slices should be a value type
>> without appending (only with ptr and length).
> 
> I don't agree.  I like that slices and arrays are the same type, it
> allows for much cleaner code, and much less function duplication.  I
> see the "append extends into a block" as an optimization, one that
> currently is invalid, and with the patch is valid.  Disallowing
> appending to slices means you must generate arrays whenever you want
> to do appending (which copies the entire slice), incurring a higher
> cost for little gain.  I'd actually say no gain.  Do you have an
> example where such a scheme is better than arrays with my patch?
> 
> Note that nobody is stopping you from making such array types, you
> are free to add them.

At the risk of restarting the debate around T[new], let me note that Steve's intuition corroborates with our experience when we tried to define T[new].

Andrei
December 24, 2009
grauzone Wrote:

> Steven Schveighoffer wrote:
> > Yes, but the problem I see with it is you are passing around that array_info pointer with the slice for *all* slices, not just ones you intend to use appending on.  There is plenty of code that uses arrays and slices without ever doing appending, so that extra pointer is wasted space, and also increases the cost of passing slices by value.
> 
> Suddenly having an additional field in the slice is the problem? Wasn't semantics the problem (see your first reply to me above in this thread).

In my original reply, I focused on the fact that your original solution *doesn't work*.  Yes, it also suffered from the extra baggage, but that was a lesser problem.

> 
> Is having an additional field in the slice really a problem?
> 
> You could just provide a "light" slice type. This wouldn't have the additional helper field, and would consist only of a pointer and length field. You could make it implicitly convertible to normal slices.

Yeah, all this is possible.  Whether it's better is subject to opinion or testing.

> I see no problem with adding an additional field to usual slices. There's also some benefit: clearer semantics (if you make guaranteed performance part of the semantics) and a much simpler runtime.

Simpler runtime is not any benefit IMO.  Nobody cares how array appending works, they just know the semantics.  The semantics wouldn't be any different with your type, you still have all the same semantics, so I don't really get that point.

> > Your idea is essentially the idea behind my patch, except you can get the array info without having an additional pointer.  The cost of lookup for the block info is mitigated by the cache, making lookups cost very little.
> 
> But the cache is volatile, and a cache entry could be easily flushed by doing many array append operations.  A single string processing function could replace the entire cache with entries for string temporaries, and cause nasty performance regressions in unrelated parts of the program (e.g. in places where you really don't want each append operation to reallocate the full array).

I'm not sure you understand the semantics.  When an array's block info is in the cache, no lookup of the block info from the GC is required, so the lock isn't taken, and the lookup is as simple as a linear search through 8 elements.  But when it is not in the cache, it is looked up via the GC, requiring a lock and a look through all the memory pools (I'm unsure of the performance, but it clearly is slower).  However, a cache miss does not mean an automatic reallocation, it just means the lookup of the block info is slower.  It can *still* be extended into the block.  Basically, the algorithm performance of appending with a cache miss is the same as a cache hit, just with a larger constant.

> The worst is, this behavior will be relatively indeterministic. It's also hard to explain to "normal" programmers.
> 
> Also, isn't it completely useless? If the programmer wants to append to an array, he'll use an append struct just to be sure, and if he doesn't want to append, inefficient array slice appending wouldn't be a problem.

Again, I think these points may be fostered by a misunderstanding of the patch.  The performance is much better than current performance, and people have had no problem dealing with the current runtime except in special circumstances.

-Steve
December 24, 2009
Leandro Lucarella Wrote:

> Steven Schveighoffer, el 24 de diciembre a las 10:15 me escribiste:
> > This is absolutely false.  It does not make anything worse.  It may not make it as good as you want, but it definitely is better than the current situation.
> 
> I meant from the language semantics POV (you have more rules to remember about arrays), I agree it makes things better from a practical POV.

In fact, semantics are simpler.  You can completely eliminate the explanation about stomping in the spec.

> > > I don't comment much about the patch (and the design) because I already
> > > did so in the several threads about it. I think is a very bad idea,
> > > I think dynamic arrays should be a proper reference type (with ptr, length
> > > and capacity) and slices should be a value type without appending (only
> > > with ptr and length).
> > 
> > I don't agree.  I like that slices and arrays are the same type, it
> > allows for much cleaner code, and much less function duplication.  I see
> > the "append extends into a block" as an optimization, one that currently
> > is invalid, and with the patch is valid.  Disallowing appending to
> > slices means you must generate arrays whenever you want to do appending
> > (which copies the entire slice), incurring a higher cost for little
> > gain.  I'd actually say no gain.  Do you have an example where such
> > a scheme is better than arrays with my patch?

No example?

> I said this several times but here it comes again: We agree that the current scheme is better in terms of performance than having a proper reference type dynamic array and a separated slice value type.
> 
> The thing is, I think optimization is nice as long as it doesn't leak the abstractions to the user. The current scheme is clearly this scenario, to make an optimization (one that problably a lot of people wouldn't need), you are making things harder for the user and creating a monster half array, half slice that is harder to understand than 2 smaller and simpler types.

Really?  You think splitting the features into 2 different types with several overlapping features is easier to explain?  I think that is why Andrei nixed the T[new] idea, because when he tried to explain it in TDPL, it was very unnatural.

> You think this performance gain is more important than the added complexity, I think the other way. You think D should be *by default* as fast as possible, even when making the programmers life a little harder, and let the user create or use library types for easy use and I think it should be the exact opposite, make *default* things easier and not as fast, and let the user create/user library types if they need top performance/optimizations.

I think the whole basis of your argument is invalid -- the current array/slice combination type is easier to use and explain *and* performs better.

-Steve

December 24, 2009
Steven Schveighoffer wrote:
>> I see no problem with adding an additional field to usual slices. There's also some benefit: clearer semantics (if you make guaranteed performance part of the semantics) and a much simpler runtime.
> 
> Simpler runtime is not any benefit IMO.  Nobody cares how array appending works, they just know the semantics.  The semantics wouldn't be any different with your type, you still have all the same semantics, so I don't really get that point.

Maybe...

>>> Your idea is essentially the idea behind my patch, except you can get the array info without having an additional pointer.  The cost of lookup for the block info is mitigated by the cache, making lookups cost very little.
>> But the cache is volatile, and a cache entry could be easily flushed by doing many array append operations.  A single string processing function could replace the entire cache with entries for string temporaries, and cause nasty performance regressions in unrelated parts of the program (e.g. in places where you really don't want each append operation to reallocate the full array).
> 
> I'm not sure you understand the semantics.  When an array's block info is in the cache, no lookup of the block info from the GC is required, so the lock isn't taken, and the lookup is as simple as a linear search through 8 elements.  But when it is not in the cache, it is looked up via the GC, requiring a lock and a look through all the memory pools (I'm unsure of the performance, but it clearly is slower).  However, a cache miss does not mean an automatic reallocation, it just means the lookup of the block info is slower.  It can *still* be extended into the block.  Basically, the algorithm performance of appending with a cache miss is the same as a cache hit, just with a larger constant.

Sorry about that, I didn't have a close look at the patch. I guess I was more thinking about Andrei's original proposal (and how I thought it may be implemented).

It seems you store the length field inside the array's memory block (instead of the cache, which just speeds up gc_query). That's awesome, but I'm going to complain again: now you have to keep a length field for *all* memory allocations, not only arrays! For most object allocations, this means 4 more bytes additional overhead. Also, if you use GC.malloc directly, and the user tries to append to slices to it, your code may break. GC.malloc doesn't seem to pad the memory block with a length field.

I must say that I find your argumentation strange: didn't you say adding an additional field to the slice struct is too much overhead?

Also, a solution which keeps the pointer to the array length in the slice struct would still be faster. The cache lookup cost is not zero.

Anyway, now that semantics and performance are somewhat sane, none of these remaining issues are too important. Thanks Steve and Andrei!

>> The worst is, this behavior will be relatively indeterministic. It's also hard to explain to "normal" programmers.
>>
>> Also, isn't it completely useless? If the programmer wants to append to an array, he'll use an append struct just to be sure, and if he doesn't want to append, inefficient array slice appending wouldn't be a problem.
> 
> Again, I think these points may be fostered by a misunderstanding of the patch.  The performance is much better than current performance, and people have had no problem dealing with the current runtime except in special circumstances.
> 
> -Steve
December 24, 2009
On Thu, 24 Dec 2009 13:39:21 -0500, grauzone <none@example.net> wrote:


> Sorry about that, I didn't have a close look at the patch. I guess I was more thinking about Andrei's original proposal (and how I thought it may be implemented).

No problem.

>
> It seems you store the length field inside the array's memory block (instead of the cache, which just speeds up gc_query). That's awesome, but I'm going to complain again: now you have to keep a length field for *all* memory allocations, not only arrays! For most object allocations, this means 4 more bytes additional overhead.

Interestingly enough, the storage overhead is zero except for memory blocks > 256 bytes.  I'll explain:

A probably not well-known piece of trivia is that D allocates 1 extra byte for arrays.  Why would it do this you say?  Because of the GC.

Imagine that it does not do this, what happens when you do something like this:

ubyte[] array = new ubyte[16]; // allocates a 16-byte block for the array
array = array[$..$];

If you look at array's ptr member, it now no longer points to the allocated block, but the next block.  Although it isn't important for the GC to keep around the allocated block any more, it now will keep the next block from being collected.  In addition, if you tried appending to array, it might start using unallocated memory!

So the byte at the end already was in use.  I sort of commandeered it for length use.  For blocks up to and including 256 bytes, I use the last byte of the block as the length storage.  For blocks of 512 to a half page (2048 bytes), I use the last 2 bytes, so there is one extra overhead byte compared to the current implementation.

Blocks larger than that follow different rules, they are not stored in bins, but just a whole page at a time.  With those blocks, it is possible to extend the block by adding more pages if they are free, so it's not OK to store the length at the end of the block, since the end of the block may change.  So I store at the beginning.  I use up a full 8 bytes, and the reason is alignment, I don't know what you're putting in the array, so I must keep the data 8 byte-aligned.

For classes, I allocate the required extra data as if the class were an array of the class data size, and then set the "ghost" length at the end of the block to 0.  If a class data exceeds a half page, the ghost length is at the beginning, where the vtable ptr is, so it's extremely unlikely to accidentally match that length.  Note that it doesn't make a huge difference in most cases because the block used for the class is a power of 2 anyways, so in most cases there's plenty of wasted space at the end.

I found out during testing that allocating a new struct is equivalent to allocating a new array of that struct of size 1, and returning it's pointer, so that aspect is already covered.

> Also, if you use GC.malloc directly, and the user tries to append to slices to it, your code may break. GC.malloc doesn't seem to pad the memory block with a length field.

Yes, this is a possible problem.  However, using GC.malloc directly and then treating the result as a normal array is probably extremely rare.  At least it is not valid for safe D.  It probably should be explained as a danger in GC.malloc, but I don't think this will adversely affect most code.  There will be functions that should obviate the need for calling GC.malloc to allocate an array (specifically, allocating uninitialized data).

> I must say that I find your argumentation strange: didn't you say adding an additional field to the slice struct is too much overhead?

Overhead when passing around, not overhead for storage.  For example, pushing 12 bytes onto the stack instead of 8 when calling a function with a slice.  If you want safe appends, you need to store the allocation length somewhere, there's no way around that.

> Also, a solution which keeps the pointer to the array length in the slice struct would still be faster. The cache lookup cost is not zero.

This is the trade-off I chose between performance when passing around slices and using them and performance for appending.  If you focus all your performance concerns on appending, then other areas suffer.  I don't know if what I chose will be the best solution, but I can't think of any way to have *both* speedy slice usage and near-zero append overhead.  If someone can think of a better solution, I'll be happy to incorporate it, but after using and understanding slices while developing stuff for Tango (Tango uses slices to get every ounce of performance!), I'm convinced that as-fast-as-possible slice semantics for passing around data is essential.

This is where a custom type that focuses performance on appending at the expense of other functions is good to have.  I think such applications where you need the absolute best append performance without any other functionality are pretty rare.

> Anyway, now that semantics and performance are somewhat sane, none of these remaining issues are too important. Thanks Steve and Andrei!

Cool!

-Steve
December 24, 2009
On Wed, 23 Dec 2009 09:54:20 -0500, grauzone <none@example.net> wrote:

> Andrei Alexandrescu wrote:
>> grauzone wrote:
>>> Steven Schveighoffer wrote:
>>>> All,
>>>>
>>>> I created a new patched druntime that prevents array stomping and at the same time increases append performance for thread-local array appending.
>>>>
>>>> The patch is attached to bugzilla entry 3637. ( http://d.puremagic.com/issues/show_bug.cgi?id=3637 )
>>>
>>> Nobody cares about it? (Except someone commenting in bugzilla, but where's Andrei and Walter?)
>>  The entire Phobos team discussed this change very thoroughly. We are all very excited about it.
>
> Did anyone try it and make initial tests?

I did (obviously).  I tried both a simple "append to N arrays in a loop" test and also the affinity test posted by dsimcha in an earlier post.  The new code beats the old runtime in both tests *and* performs better on the affinity test with multiple cores than it does with a single core.  Tests were done on linux.  I realize testing my own code doesn't mean much, but I think the patch is close to being assimilated, and everyone can try it out at that point.  I'd be very interested to see if others can find holes in the design.

I don't have the numbers in front of me right now, but next week, I'll try to post some.

AFAIK, only Sean tried to get it working, but had trouble.  I was mostly concerned with if the patch *worked* on Windows and OSX, since I didn't have a mac or a Windows box to try it out.  Since then, I've successfully built and tested both OSX and Windows.

-Steve
December 24, 2009
Steven Schveighoffer, el 24 de diciembre a las 12:59 me escribiste:
> Leandro Lucarella Wrote:
> 
> > Steven Schveighoffer, el 24 de diciembre a las 10:15 me escribiste:
> > > This is absolutely false.  It does not make anything worse.  It may not make it as good as you want, but it definitely is better than the current situation.
> > 
> > I meant from the language semantics POV (you have more rules to remember about arrays), I agree it makes things better from a practical POV.
> 
> In fact, semantics are simpler.  You can completely eliminate the explanation about stomping in the spec.

What about the cache? You have to explain to the users that there is
a possibility that their appends works blazing fast, except... well,
sometimes if you fill the cache they will suck. As gruzone said, find this
kind of weird behaviours will be very hard.

> > > > I don't comment much about the patch (and the design) because I already
> > > > did so in the several threads about it. I think is a very bad idea,
> > > > I think dynamic arrays should be a proper reference type (with ptr, length
> > > > and capacity) and slices should be a value type without appending (only
> > > > with ptr and length).
> > > 
> > > I don't agree.  I like that slices and arrays are the same type, it
> > > allows for much cleaner code, and much less function duplication.  I see
> > > the "append extends into a block" as an optimization, one that currently
> > > is invalid, and with the patch is valid.  Disallowing appending to
> > > slices means you must generate arrays whenever you want to do appending
> > > (which copies the entire slice), incurring a higher cost for little
> > > gain.  I'd actually say no gain.  Do you have an example where such
> > > a scheme is better than arrays with my patch?
> 
> No example?

No, I said that I think arrays/slices as they are have better performance (when you know exactly what you're doing and pre-allocate and do that kind of tricks). Unless you are asking for some other kind of examples?

> > I said this several times but here it comes again: We agree that the current scheme is better in terms of performance than having a proper reference type dynamic array and a separated slice value type.
> > 
> > The thing is, I think optimization is nice as long as it doesn't leak the abstractions to the user. The current scheme is clearly this scenario, to make an optimization (one that problably a lot of people wouldn't need), you are making things harder for the user and creating a monster half array, half slice that is harder to understand than 2 smaller and simpler types.
> 
> Really?

Yes.

> You think splitting the features into 2 different types with several overlapping features is easier to explain?

Yes. And I think they have overlapping features just like any other
2 containers, hashes and arrays have overlapping features, but they are
completely different beasts. Same goes for arrays and slices.

PHP have arrays and hashes combined in one type. You think that's a good
idea because arrays and hashes have overlapping features? I really don't,
and I do think arrays and slices are 2 completely different beasts that
needs to be separated in the language. I think that's the natural good
thing to do, not an optimization. I think D is broken by combining this
2 types into one. You don't think that (and it looks like Andrei and
Walter doesn't think that either) so there's no way we would agree on this
topic.

> I think that is why Andrei nixed the T[new] idea, because when he tried to explain it in TDPL, it was very unnatural.

Well, I don't know what Andrei tried to do, but I think Python does pretty well with it, for example.

> > You think this performance gain is more important than the added complexity, I think the other way. You think D should be *by default* as fast as possible, even when making the programmers life a little harder, and let the user create or use library types for easy use and I think it should be the exact opposite, make *default* things easier and not as fast, and let the user create/user library types if they need top performance/optimizations.
> 
> I think the whole basis of your argument is invalid -- the current array/slice combination type is easier to use and explain *and* performs better.

I can say exactly the same about you. That's why I said I don't argue anymore about this (grrr! I don't know why I ended up doing it in this thread). This is a matter of taste, you think array/slice combined are easier to use and I don't. About performance, they only spot where splitting arrays and slices would perform worse than the current slice/arrays combined is in that you add an extra indirection for arrays because of the reference semantics (but you need to pass around a word less than with slices, so I don't know how the real impact of that would be). Besides that, you can make both schemes as fast, I think. It all depends on how you use those types.

-- 
Leandro Lucarella (AKA luca)                     http://llucax.com.ar/
----------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------
Y2K
<Aztech_> hmm, nothing major has happend, what an anticlimax
<CaPS> yeah
<CaPS> really sucks
<CaPS> I expected for Australia to sink into the sea or something
<CaPS> but nnoooooooo
December 24, 2009
Steven Schveighoffer:

>The new code beats the old runtime in both tests<

Are you testing Phobos or Tango? Because I think Tango may have a performance bug there that's missing in Phobos (another even worse performance bug that I think is present in Tango is on the isIn_r of built in associative arrays).

Bye,
bearophile
December 24, 2009
Leandro Lucarella wrote:
> Steven Schveighoffer, el 24 de diciembre a las 12:59 me escribiste:
>> Leandro Lucarella Wrote:
>>
>>> Steven Schveighoffer, el 24 de diciembre a las 10:15 me escribiste:
>>>> This is absolutely false.  It does not make anything worse.  It may not
>>>> make it as good as you want, but it definitely is better than the
>>>> current situation.
>>> I meant from the language semantics POV (you have more rules to remember
>>> about arrays), I agree it makes things better from a practical POV.
>> In fact, semantics are simpler.  You can completely eliminate the
>> explanation about stomping in the spec.
> 
> What about the cache? You have to explain to the users that there is
> a possibility that their appends works blazing fast, except... well,
> sometimes if you fill the cache they will suck. As gruzone said, find this
> kind of weird behaviours will be very hard.

Fluctuations in the speed of an operation are commonplace and don't require explanation for the putative user. Caches exist in many places in today's computers, and I don't see how all of a sudden this one cache sticks like a sore thumb.

>> You think splitting the features into 2 different types with
>> several overlapping features is easier to explain?
> 
> Yes. And I think they have overlapping features just like any other
> 2 containers, hashes and arrays have overlapping features, but they are
> completely different beasts. Same goes for arrays and slices.

The problem with T[new] vs. T[] was that it was very difficult to explain succintly when and how each should be used. I had two tables in TDPL and they were for the most part identical - I had to highlight the subtle differences. It's a far cry from arrays vs. hashes.

> PHP have arrays and hashes combined in one type. You think that's a good
> idea because arrays and hashes have overlapping features? I really don't,
> and I do think arrays and slices are 2 completely different beasts that
> needs to be separated in the language. I think that's the natural good
> thing to do, not an optimization. I think D is broken by combining this
> 2 types into one. You don't think that (and it looks like Andrei and
> Walter doesn't think that either) so there's no way we would agree on this
> topic.

I also think PHP's was not a brilliant idea, but I also think there's no comparison. There could be a way of agreeing if you came with a solid argument. For example in the thread about dropping binding "this" to an rvalue I had to acquiesce in wake of good counterexamples. Also, in the thread on "typedef" there was a good level of consensus that the feature doesn't pull its weight.


Andrei