September 07, 2006
Sean Kelly wrote:
> xs0 wrote:
>> Sean Kelly wrote:
>>> xs0 wrote:
>>> Ah good point.  So any ideas?  Personally, I'd be happy if the default AA used a linked-list for chaining and opEquals for comparison so this wasn't an issue, but that seems a foregone conclusion.  I'm mostly concerned about providing default behavior that makes sense for the common case.  The problem I've run into is that I have objects which need to be used as keys in an AA but are not inherently sortable.
>>
>> Well, one possibly good solution would be to templatize AAs and remove opCmp from Object. That way, you'd get an error when using comparison on objects without opCmp (most other operators work that way) and the template could test for its presence and use different implementations for each.
> 
> Some fancier AA support would be nice, but I'm not sure how to add templates to the built-in syntax.  

I think xs0 was thinking of making AAs templates, and then the compiler would add syntactic sugar to that template to achieve current syntax.
So something like
char[float[]] a; would become something like AA!(char, float[]) a;

> I just noticed something confusing in the spec however.  Why are classes intended to be used as keys in an AA required to implement both opCmp *and* opEquals?  Since the AA is backed by a binary tree I would expect it determine 'equality' as equivalence using opCmp.

I would too.

> It's quite common for me to define classes that are equivalent but not equal, and also classes that can be compared for equality but not sorted.  But it sounds like both would not be compatible with D's built-in AA implementation.  

It is a rather big limitation.

> I'm beginning to feel that any possible performance advantage that can be offered from binary tree chaining over linked list chaining is overshadowed by the limitations it imposes on the use of AAs in general.  Can someone suggest an alternate default implementation for opCmp that solves these problems?

Yes: here is a suggestion: remove opCmp from Object. I think the only reason it is there is that when AAs where first implemented templates weren't where they are now so there was no way to check if an object has opCmp. These days a template version of AAs would be much better, and it would (if I'm not mistaken) remove the need for opCmp to be in Object.
September 08, 2006
Sean Kelly wrote:
> Some fancier AA support would be nice, but I'm not sure how to add templates to the built-in syntax.  I just noticed something confusing in the spec however.  Why are classes intended to be used as keys in an AA required to implement both opCmp *and* opEquals?  Since the AA is backed by a binary tree I would expect it determine 'equality' as equivalence using opCmp.  It's quite common for me to define classes that are equivalent but not equal, and also classes that can be compared for equality but not sorted.  But it sounds like both would not be compatible with D's built-in AA implementation.  I'm beginning to feel that any possible performance advantage that can be offered from binary tree chaining over linked list chaining is overshadowed by the limitations it imposes on the use of AAs in general.  Can someone suggest an alternate default implementation for opCmp that solves these problems?

Although the current AA implementation only uses opCmp, having both available means that an implementation could use both or either. Sometimes opEquals can be computed a lot more efficiently than opCmp.
The current implementation uses hashing with binary trees, but an implementation isn't required to do it that way.

Hashing with binary trees has proven in practice to be very efficient. It beats even C++ templated hash tables. I'm sure with some careful coding one can create a faster one customized to a particular data type, but the current general purpose one is pretty darned fast and compact. It's the central algorithm in DMDScript, for example, and DMDScript still beats the other javascript implementations around the block.
September 08, 2006
Ivan Senji wrote:
> Yes: here is a suggestion: remove opCmp from Object. I think the only reason it is there is that when AAs where first implemented templates weren't where they are now so there was no way to check if an object has opCmp. These days a template version of AAs would be much better, and it would (if I'm not mistaken) remove the need for opCmp to be in Object.

While it'd be fun to offer a templated version of AAs, I feel the core capabilities should be available to the user without needing templates. This is because many programmers are not comfortable with them.

Can you give an example of a class that could not have a meaningful opCmp implementation that one would want to put into an AA?
September 08, 2006
On Thu, 07 Sep 2006 14:54:41 -0700, Sean Kelly <sean@f4.ca> wrote:

>Since the AA is backed by a binary tree I would expect it determine 'equality' as equivalence using opCmp.

There may be special case operations on the AA where opCmp is unnecessary and opEquals is a little faster.

But... I had got the impression that the built-in associative arrays were done with hashtables. I think the manual says something about foreach not enumerating the items in order.

OTOH, I didn't see any kind of 'opHash' or similar.

So - binary trees, then? Red-black, perhaps?

Or are you talking about hash collision handling?

Or have I completely got the wrong idea about something?

>Can someone suggest an alternate default implementation for opCmp that solves these problems?

Don't know if it helps, but sometimes a consistent-but-arbitrary ordering can give good keys. For instance, the ordering of hash values is meaningless, but you can still use it to divide-and-conquer your data.

-- 
Remove 'wants' and 'nospam' from e-mail.
September 08, 2006
Walter Bright wrote:
> 
> Although the current AA implementation only uses opCmp, having both available means that an implementation could use both or either. Sometimes opEquals can be computed a lot more efficiently than opCmp.
> The current implementation uses hashing with binary trees, but an implementation isn't required to do it that way.

This is good to know.  I'd considered playing with the current implementation, but was concerned about altering spec-defined behavior.

> Hashing with binary trees has proven in practice to be very efficient. It beats even C++ templated hash tables. I'm sure with some careful coding one can create a faster one customized to a particular data type, but the current general purpose one is pretty darned fast and compact. It's the central algorithm in DMDScript, for example, and DMDScript still beats the other javascript implementations around the block.

I agree that it's got fantastic performance--my concerns here are really more practical, as I have objects I want to use as keys that don't contain any data which could be used for sorting purposes.  But it sounds like there's enough wiggle room in the spec to find some reasonable middle ground.  Thanks!


Sean
September 08, 2006
Walter Bright wrote:
> Ivan Senji wrote:
>> Yes: here is a suggestion: remove opCmp from Object. I think the only reason it is there is that when AAs where first implemented templates weren't where they are now so there was no way to check if an object has opCmp. These days a template version of AAs would be much better, and it would (if I'm not mistaken) remove the need for opCmp to be in Object.
> 
> While it'd be fun to offer a templated version of AAs, I feel the core capabilities should be available to the user without needing templates. This is because many programmers are not comfortable with them.
> 
> Can you give an example of a class that could not have a meaningful opCmp implementation that one would want to put into an AA?

The one I've been wrestling with is a Thread class, as it has no data that is static and unique for the life of the object.  The thread id is only guaranteed to be unique while the thread is active, and the id isn't even assigned until Thread.start is called.  Threads simply have no features which inherently support ordering, but it makes sense to compare two Thread objects for equality simply by comparing object addresses.


Sean
September 08, 2006
Steve Horne wrote:
> On Thu, 07 Sep 2006 14:54:41 -0700, Sean Kelly <sean@f4.ca> wrote:
> 
>> Since the AA is backed by a binary tree I would expect it determine 'equality' as equivalence using opCmp.
> 
> There may be special case operations on the AA where opCmp is
> unnecessary and opEquals is a little faster.
> 
> But... I had got the impression that the built-in associative arrays
> were done with hashtables. I think the manual says something about
> foreach not enumerating the items in order.
> 
> OTOH, I didn't see any kind of 'opHash' or similar.
> 
> So - binary trees, then? Red-black, perhaps?
> 
> Or are you talking about hash collision handling?

Traditional binary trees as used for collision handling in the DMD AA implementation.

>> Can someone suggest an alternate default implementation for opCmp that solves these problems?
> 
> Don't know if it helps, but sometimes a consistent-but-arbitrary
> ordering can give good keys. For instance, the ordering of hash values
> is meaningless, but you can still use it to divide-and-conquer your
> data.

I suppose the problem is how to do this without exposing an opCmp function to be used for more general sorting purposes.  The implementation I proposed on d.D.bugs (return this !is obj) works okay for binary trees, but as Ivan pointed out, it would be terrible for a sort operation.


Sean
September 08, 2006
Steve Horne wrote:

> On Thu, 07 Sep 2006 14:54:41 -0700, Sean Kelly <sean@f4.ca> wrote:
> 
>>Since the AA is backed
>>by a binary tree I would expect it determine 'equality' as equivalence
>>using opCmp.
> 
> There may be special case operations on the AA where opCmp is unnecessary and opEquals is a little faster.
> 
> But... I had got the impression that the built-in associative arrays were done with hashtables. I think the manual says something about foreach not enumerating the items in order.
> 
> OTOH, I didn't see any kind of 'opHash' or similar.
> 
> So - binary trees, then? Red-black, perhaps?
> 
> Or are you talking about hash collision handling?

The binary tree is only used for hash collision handling. It is not balanced in any way.

The advantage of having binary trees at the nodes is that with a poor hash function, the AA degrades gracefully into O(log n). This is probably good for a general purpose AA. The fact that the trees aren't balanced would mean that in theory, you could get a O(n) worst case lookup.

The disadvantages of binary tree nodes are:
- Less than optimal performance with a good hash function
- higher memory usage (but phobos counters this somewhat by using fewer
buckets, with up to an average of 4 collisions per bucket)
- the necessity of supplying an opCmp

You can get significantly better performance using a custom implementation (example: http://www.digitalmars.com/d/archives/digitalmars/D/30773.html). Unfortunately, without a reference return type, there is no way to give a custom AA the same semantics as the built in one.

> 
>>Can someone
>>suggest an alternate default implementation for opCmp that solves these
>>problems?
> 
> Don't know if it helps, but sometimes a consistent-but-arbitrary ordering can give good keys. For instance, the ordering of hash values is meaningless, but you can still use it to divide-and-conquer your data.

You would still need to handle the case of identical object hashes.

/Oskar
September 08, 2006
Oskar Linde wrote:
> 
> You would still need to handle the case of identical object hashes.

I've been thinking about this and am wondering if it even makes sense to use objects such as threads in an AA.  Address can't even really be used for the hash value because of moving GCs, so I'm forced to either store the hash value in Thread the first time it's computed or to find some other hash to use.  It may simply make more sense to store them in an unsorted array.  Note that this is for a ThreadGroup so there will likely never be thousands of objects stored that need to be searched through.


Sean
September 08, 2006
Walter Bright wrote:
> Ivan Senji wrote:
>> Yes: here is a suggestion: remove opCmp from Object. I think the only reason it is there is that when AAs where first implemented templates weren't where they are now so there was no way to check if an object has opCmp. These days a template version of AAs would be much better, and it would (if I'm not mistaken) remove the need for opCmp to be in Object.
> 
> While it'd be fun to offer a templated version of AAs, I feel the core capabilities should be available to the user without needing templates. This is because many programmers are not comfortable with them.

I didn't mean change int[char[]] to AA!(int, char[]),
I meant it would be nice to change the implementation to a template, without changing the syntax.

I guess this is one thing in D I still can't get over the C++ way.
(a couple of years back I thought iostream and <<, >> where the best :)
But these AAs and their requirements are bugging me: Why do I have to (in my class ABC) declare opCmp to be int opCmp(Object o)?
IMO that sucks because when used in an AA it's not like there is going to be objects of different types as keys.

And that is a textbook example of when a template should be used.


> 
> Can you give an example of a class that could not have a meaningful opCmp implementation that one would want to put into an AA?

Sean Kelly gave a good example.

But I have nothing against AAs requiring opCmp, opEquals, toHash or what ever it needs, but would prefer for them not to be in Object.