Jump to page: 1 2
Thread overview
Robustness of the built in associative array
Mar 24, 2006
Oskar Linde
Mar 24, 2006
Nick
Mar 24, 2006
Sean Kelly
Mar 24, 2006
Oskar Linde
Mar 25, 2006
Nick
Mar 28, 2006
Wolfgang Draxinger
Mar 28, 2006
Wolfgang Draxinger
Mar 25, 2006
Nick
Mar 26, 2006
Dave
Mar 27, 2006
Nick
Mar 27, 2006
Dave
Mar 27, 2006
Bruno Medeiros
Mar 28, 2006
Hong
Mar 24, 2006
Oskar Linde
Mar 26, 2006
Bruno Medeiros
Mar 28, 2006
Walter Bright
March 24, 2006
Hello,

The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. Rewriting the code with a non-templated version for clarity:

void increment(int[char[]] map, char[] key) {
	if (auto t = (key in map))
		(*t)++;
	else
		map[key] = 1;
}

To understand the code, one first has to know if the associative array is a value type or a reference type.

It turns out it is neither. It is definitely not a value type. The above function will in most cases alter the original associative array and work as expected. But it is not a reference type either -- in the few cases where adding a new key to the AA invokes a rehash, all hell breaks loose. The original associative array is suddenly *corrupted*! (The way to make the function work as expected is to make the map an inout argument, but that is beside my point.)

Internally, an AA is implemented as an aaA*[] -- a dynamic array of pointers to aaA-nodes. The first node of the array contains metadata (the total number of keys), and the remaining [1..$] nodes contains the hash buckets. When passing an aa to a function, the aaA*[] gets passed. I assume the reason for this implementations is efficiency. All you need to do an efficient lookup is the array of buckets. But is this implementation comes at a cost of robustness. To make use of AA's, you have to follow a special rule:

- Never add an element to or rehash the AA unless you know you hold the only reference.

This means that even though the AA behaves as a reference type, the data can not be shared unless all references are readonly. I feel I must see this bastardization of being neither value or reference type a design mistake.

A simple fix is to change the AA implementation so that AAs become MyAA*  instead of aaA*[], with MyAA defined as:

struct MyAA {
	AANode[] buckets;
	uint numberOfNodes;
}

You get a very slight additional lookup cost (double dereferencing to get to the buckets), but make the AA a pure reference type without any caveats and problems.

My questions are:

1) Why is the D built in array type designed in such a fragile way? Should D not be robust language? Do users have to resort to writing library wrappers around D's types to get robustness? If so:

2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation:

Map!(char[],int) map;

rather than:

int[char[]] map;

Cons:
 - This slight syntax difference.

Pros:
 - The library version would be transparently replaceable by other implementations when the data or other conditions have different algorithmic requirements.
 - The library implementation could easily be more efficient than the current built in version by taking advantage of templates.
 - The library implementation could without problems be made to work with types that lack opCompare.
 - The library version would not have any strange quirks and gotchas. (Dangerous usage cases like above)
 - Greater library/language decoupling.
 - The possibility of implementing an UnsafeMap type with high performance but less safety for the few cases where the performance really matters.

In my opinion, D's lack of protection (const arrays, etc) and robustness against programming mistakes is D's greatest weakness.

My contribution to the slogan thread:

"D - Programming without a lifeline" :)

/Oskar
March 24, 2006
In article <e00v0m$te2$3@digitaldaemon.com>, Oskar Linde says...
>
>Hello,
>
>The latest "Implicit fn template instantiation" thread contains code
>that contains a hidden and possibly hard to find bug.
>[...]

Good catch!

>2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following:

#  int[int] ii;
#  ii[10]++;

I find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA.

Nick


March 24, 2006
Nick wrote:
> In article <e00v0m$te2$3@digitaldaemon.com>, Oskar Linde says...
>> Hello,
>>
>> The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug.
>> [...]
> 
> Good catch!

Indeed.  It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not.  But the original should never be rendered unusable.

>> 2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation
> 
> Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
> that is currently _impossible_ to duplicate in custom types. Eg. the following:
> 
> #  int[int] ii;
> #  ii[10]++;

Yup, not having a bona fide reference type in D is problematic.  Though you could always use pointers :-p

> I find it rather confusing and frustrating. In my current project I ended up
> making my own template AA to get around some of the deficiencies of the builtin
> type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
> without encapsulating the key in a struct/class (think case-insensitive char[]
> keys.) The result was also faster and more memory efficient than the builtin AA.

This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice.  I definately do like having built in AAs, but I do find some of the consequences irritating.


Sean
March 24, 2006
Nick wrote:
> In article <e00v0m$te2$3@digitaldaemon.com>, Oskar Linde says...
>> Hello,
>>
>> The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug.
>> [...]
> 
> Good catch!
> 
>> 2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation
> 
> Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
> that is currently _impossible_ to duplicate in custom types. Eg. the following:
> 
> #  int[int] ii;
> #  ii[10]++;

You correct. I don't know how I could have forgotten that. D's lack of a reference return type has implications on properties too.

> I find it rather confusing and frustrating. In my current project I ended up
> making my own template AA to get around some of the deficiencies of the builtin
> type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
> without encapsulating the key in a struct/class (think case-insensitive char[]
> keys.) The result was also faster and more memory efficient than the builtin AA.

I have similar experiences implementing a custom templated hash-table.

/Oskar
March 24, 2006
Sean Kelly wrote:
> Nick wrote:
>> In article <e00v0m$te2$3@digitaldaemon.com>, Oskar Linde says...

>>> 2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation
>>
>> Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
>> that is currently _impossible_ to duplicate in custom types. Eg. the following:
>>
>> #  int[int] ii;
>> #  ii[10]++;
> 
> Yup, not having a bona fide reference type in D is problematic.  Though you could always use pointers :-p

Consider the consequences of using a pointer in the above example. ;)

>> I find it rather confusing and frustrating. In my current project I ended up
>> making my own template AA to get around some of the deficiencies of the builtin
>> type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
>> without encapsulating the key in a struct/class (think case-insensitive char[]
>> keys.) The result was also faster and more memory efficient than the builtin AA.
> 
> This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice.  I definately do like having built in AAs, but I do find some of the consequences irritating.

Is there any way to retain the syntax and ease of use of the built in AA with a library implementation allowing the flexibility of choosing:
- implementation
- hash function
- comparison function
?

/Oskar
March 25, 2006
In article <e01a21$te2$5@digitaldaemon.com>, Oskar Linde says...
>
>Is there any way to retain the syntax and ease of use of the built in AA
>with a library implementation allowing the flexibility of choosing:
>- implementation
>- hash function
>- comparison function
>?

I think the STL containers do a very good job at being both flexible and usable, covering most normal cases of use. The thing missing from D to make a STL-like library would be the reference return type.

Some of the syntax would still be impossible to duplicate, eg.

int[int] ii;
ii[1]++; // This inserts the key '1'
int i = ii[2]; // This does NOT insert '2'

but I always thought this was the wrong way to do it anyway. (Eg. what should happen when the value is a struct and you use ii[3].foo?)

Nick


March 25, 2006
In article <e0169e$25hq$1@digitaldaemon.com>, Sean Kelly says...
>
>Indeed.  It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not.  But the original should never be rendered unusable.

True. Except that adding an element to an array (when no realloc occurs) does not change the original either.

>This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice.  I definately do like having built in AAs, but I do find some of the consequences irritating.

Sort of OT, but I've studied the builtin AA code and have come to the conclusion that opCmp() isn't needed NOW either. In the binary tree implementation that Walter used, he first compares the hash values, and only if those matches he calls opCmp(). Since actual hash collisions should be pretty uncommon(*), opCmp is almost never called(**). In the rare case of a hash collision, one could make a special case and eg. always place it in the left node.

(*) It is uncommon for decent hash functions. It seems the builtin AA is optimized to also work with very bad hashers (in which case it approaches a pure binary tree.) I guess this isn't a bad feature for a generic AA, but it is also some of the reason why my custom hash map outperformed it (I didn't use btrees at all.)

(**) Just did a test with an english dictionary of 173528 words and the standard
D string hasher. opCmp() was called 1840 times, slightly over 1%.

Nick


March 26, 2006
In article <e0360o$28ep$1@digitaldaemon.com>, Nick says...
>
>(*) It is uncommon for decent hash functions. It seems the builtin AA is optimized to also work with very bad hashers (in which case it approaches a pure binary tree.) I guess this isn't a bad feature for a generic AA, but it is also some of the reason why my custom hash map outperformed it (I didn't use btrees at all.)
>

Just curious - what portion of the perf. diff. was that? What data types? And in your estimation what accounts for the rest of the diff. (e.g. are you pre-allocating a bunch of space for the data (as opposed to the key) or not using the GC)?

Thanks,

- Dave


March 26, 2006
Oskar Linde wrote:
> Hello,
> 
> The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. Rewriting the code with a non-templated version for clarity:
> 
> void increment(int[char[]] map, char[] key) {
>     if (auto t = (key in map))
>         (*t)++;
>     else
>         map[key] = 1;
> }
> 
> To understand the code, one first has to know if the associative array is a value type or a reference type.
> 
> It turns out it is neither. It is definitely not a value type. The above function will in most cases alter the original associative array and work as expected. But it is not a reference type either -- in the few cases where adding a new key to the AA invokes a rehash, all hell breaks loose. The original associative array is suddenly *corrupted*! (The way to make the function work as expected is to make the map an inout argument, but that is beside my point.)
> 
> Internally, an AA is implemented as an aaA*[] -- a dynamic array of pointers to aaA-nodes. The first node of the array contains metadata (the total number of keys), and the remaining [1..$] nodes contains the hash buckets. When passing an aa to a function, the aaA*[] gets passed. I assume the reason for this implementations is efficiency. All you need to do an efficient lookup is the array of buckets. But is this implementation comes at a cost of robustness. To make use of AA's, you have to follow a special rule:
> 
> - Never add an element to or rehash the AA unless you know you hold the only reference.
> 
> This means that even though the AA behaves as a reference type, the data can not be shared unless all references are readonly. I feel I must see this bastardization of being neither value or reference type a design mistake.
> 

Oskar, thanks for pointing that out. That was one issue that I hadn't noticed yet, and I agree wholeheartedly that is a quite a design problem.

As a side note: I would prefer saying that these are more like broken reference types, rather than saying they are not reference types, since that may not be the best way to describe it (not being reference or value types is more the nature of the problem of static arrays).
Anyway, in any case you description of the problem is quite clear.

Also, this illustrates another issue, which is the tricky reference semantics of dynamic arrays. Dynamic arrays are reference types, but they have a "catch", in that unlike all other reference types, there is an operation other than assignment that changes the identity of the array. It is the length change operation. But what is *really bad*, is that it is not "deterministic"[1] whether the resulting array from a length change will a brand new one or not. The identity (or perhaps better said, the contents) can be all new, or partially the same.

[1] "deterministic" is not the 100% correct word (as it is more the opposite of random), but I'm missing a better term.

> A simple fix is to change the AA implementation so that AAs become MyAA*  instead of aaA*[], with MyAA defined as:
> 
> struct MyAA {
>     AANode[] buckets;
>     uint numberOfNodes;
> }
> 
> You get a very slight additional lookup cost (double dereferencing to get to the buckets), but make the AA a pure reference type without any caveats and problems.
> 
> My questions are:
> 
> 1) Why is the D built in array type designed in such a fragile way? Should D not be robust language? Do users have to resort to writing library wrappers around D's types to get robustness? If so:
> 

I'd like to know that too. One more problem to the list, and yet people are already thinking about 1.0 and D slogans and whatnot. :/


-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
March 27, 2006
Sean Kelly wrote:
>>> 2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation
>>
>> Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
>> that is currently _impossible_ to duplicate in custom types. Eg. the following:
>>
>> #  int[int] ii;
>> #  ii[10]++;
> 
> Yup, not having a bona fide reference type in D is problematic.  Though you could always use pointers :-p
> 
WHOOOA there. "Reference type" is a pretty overloaded term that means two distinct things, which I hope you already know:
* Reference types (as in vs. value types)
* Reference types or references (as in C++ references)
I don't know any unambiguous terminology for these two terms, there may not even be one (correct me if wrong). However, nowadays, most of the time when one says "reference types" it means the first thing (me inclusive). So please, when you speak of the C++ references please disambiguate the term. (usually called just "references", although this is still very close to the other term)

As for you point. Yes, this demonstrates a good usage scenario for a "C++ reference". Should there be such a feature in D? Or can this problem be solved in an alternate way? Is the added complexity of this feature worth it? I don't know I cannot come to a conclusion. :/


-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
« First   ‹ Prev
1 2