November 12, 2012
On Mon, Nov 12, 2012 at 08:27:51PM +0400, Dmitry Olshansky wrote:
> 11/12/2012 6:43 PM, monarch_dodra пишет:
> >D has a built in hashed associative container, which is great.
> >
> 
> 
> No. The literals are nice though.
> 
> >I've noticed though it has no built-in "hashed container" for when you just need to store unique elements, but without any specific data associated (lists of names, for example).
> 
> There is no need for built-in containers. If anything there is plan to move hash out of built-ins. The fact that it utterly failed to happen doesn't prove anything as it still has to be the library type.
[...]

I'm still hoping to get back to moving AA into the library one day. There are a lot of open bugs with the current AA implementation that stem from the fact that it's in two different places, and the core AA code has no access to value types.


T

-- 
Klein bottle for rent ... inquire within. -- Stephen Mulraney
November 12, 2012
On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
>> The current "workaround" is to just dummy it.
>> alias bool Dummy;
>> Dummy[string] names;
>> names["Paul"] = true; //Ew.
>
> Make a wrapper?

The problem with wrapped versions of associative arrays is that they just don't scale with number of keys.  If I want to store (say) a set of 3 million size_t's, then it costs a lot more than 3_000_000 * size_t.sizeof() to do so this way.

Do dedicated set implementations do better than this?

I should say that my own interests in an implementation of set are twofold -- first, efficient storage; and second, being able to do an efficient foreach across the stored values, without any concern for order.

> Or rather call them sets and have them in the library.

The whole builtin-vs.-library thing seems a big distraction -- if something needs to be implemented, the library seems the natural place unless there's a very good reason otherwise.
November 12, 2012
11/12/2012 8:52 PM, Joseph Rushton Wakeling пишет:
> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
>>> The current "workaround" is to just dummy it.
>>> alias bool Dummy;
>>> Dummy[string] names;
>>> names["Paul"] = true; //Ew.
>>
>> Make a wrapper?
>
> The problem with wrapped versions of associative arrays is that they
> just don't scale with number of keys.  If I want to store (say) a set of
> 3 million size_t's, then it costs a lot more than 3_000_000 *
> size_t.sizeof() to do so this way.
>
> Do dedicated set implementations do better than this?

Yes, sure they can. Starting with e.g. a bitset.
If the distribution is sparce then inversion lists can do wonders. They work especially well if your distribution has contiguous intervals, searching would be logN though.

I use one for unicode character sets and it saves a LOT of space.

Otherwise if it's sparce but doesn't have contiguous interval it seems entirely doable to combine the two in a universal datastructure for integral sets. Say a sorted array of 32/64-element bitsets (depending on architecture):

//denotes a small bit set covering interval of [start..start+32/64)
struct node{
	size_t start;
	size_t bits;
};

Should be quite compact & fast.

>
> I should say that my own interests in an implementation of set are
> twofold -- first, efficient storage; and second, being able to do an
> efficient foreach across the stored values, without any concern for order.
>
>> Or rather call them sets and have them in the library.
>
> The whole builtin-vs.-library thing seems a big distraction -- if
> something needs to be implemented, the library seems the natural place
> unless there's a very good reason otherwise.


-- 
Dmitry Olshansky
November 12, 2012
On Mon, Nov 12, 2012 at 05:52:32PM +0100, Joseph Rushton Wakeling wrote:
> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
> >>The current "workaround" is to just dummy it.
> >>alias bool Dummy;
> >>Dummy[string] names;
> >>names["Paul"] = true; //Ew.
> >
> >Make a wrapper?
> 
> The problem with wrapped versions of associative arrays is that they just don't scale with number of keys.  If I want to store (say) a set of 3 million size_t's, then it costs a lot more than 3_000_000 * size_t.sizeof() to do so this way.
>
> Do dedicated set implementations do better than this?

Probably.

For storing integer types, you could optimize for space by dividing your key space into blocks of, say, 32 entries each. Then your hash function will hash only the upper (n-5) bits of the key, and each hash slot will be a bit array of 32 bits, which are indexed by the lower 5 bits of the key. Assuming your keys are relatively densely populated, this will save quite a significant amount of space. In the sparse case, you probably don't waste that much space either, since each slot is only one 32-bit int wide.

You can probably also increase the size of the slots to, say, 64 bits or more, if you want to maximize the usage to GC allocation block overhead ratio. Yes, each slot is only 1 int wide, but the GC does have to maintain bookkeeping information that's probably a lot more than that. Allocating a large number of very small blocks is generally worse than a smaller number of medium-sized blocks.

If your keys are *very* dense, then storing ranges instead of individual elements is the way to go (though the implementation will be significantly more complex).

For storing sets of strings, though, you'd want something like a trie instead.

It all depends on your specific application. So a per application implementation might not be out of place.


> I should say that my own interests in an implementation of set are twofold -- first, efficient storage; and second, being able to do an efficient foreach across the stored values, without any concern for order.

With the hash + bit-array approach above, foreach would be very simple (just walk the hash table slots and enumerate the bits in each slot).


> >Or rather call them sets and have them in the library.
> 
> The whole builtin-vs.-library thing seems a big distraction -- if something needs to be implemented, the library seems the natural place unless there's a very good reason otherwise.

Yeah, there's definitely an advantage to minimizing the built-in stuff and keeping them in the library. Keeps the language relatively clean and simple, but still provide convenient library types for common tasks.


T

-- 
Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.
November 12, 2012
11/12/2012 9:44 PM, H. S. Teoh пишет:
> On Mon, Nov 12, 2012 at 05:52:32PM +0100, Joseph Rushton Wakeling wrote:
>> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
>>>> The current "workaround" is to just dummy it.
>>>> alias bool Dummy;
>>>> Dummy[string] names;
>>>> names["Paul"] = true; //Ew.
>>>
>>> Make a wrapper?
>>
>> The problem with wrapped versions of associative arrays is that they
>> just don't scale with number of keys.  If I want to store (say) a
>> set of 3 million size_t's, then it costs a lot more than 3_000_000 *
>> size_t.sizeof() to do so this way.
>>
>> Do dedicated set implementations do better than this?
>
> Probably.
>
> For storing integer types, you could optimize for space by dividing your
> key space into blocks of, say, 32 entries each. Then your hash function
> will hash only the upper (n-5) bits of the key, and each hash slot will
> be a bit array of 32 bits, which are indexed by the lower 5 bits of the
> key. Assuming your keys are relatively densely populated, this will save
> quite a significant amount of space. In the sparse case, you probably
> don't waste that much space either, since each slot is only one 32-bit
> int wide.

Problem is that keys will collide so each slot will have to have upper bits of key *and* the small set. The bucket itself is still inside intrusive linked list.

Otherwise the idea is great. About 3 words of overhead per word-sized set.

>
> You can probably also increase the size of the slots to, say, 64 bits or
> more, if you want to maximize the usage to GC allocation block overhead
> ratio. Yes, each slot is only 1 int wide, but the GC does have to
> maintain bookkeeping information that's probably a lot more than that.

16 bytes. Though you'd have to know how bucket entry looks like which is impossible with built-in hash map sadly.

> Allocating a large number of very small blocks is generally worse than a
> smaller number of medium-sized blocks.
>
[snip other good points]


-- 
Dmitry Olshansky
November 12, 2012
On Mon, Nov 12, 2012 at 09:58:14PM +0400, Dmitry Olshansky wrote:
> 11/12/2012 9:44 PM, H. S. Teoh пишет:
[...]
> >For storing integer types, you could optimize for space by dividing your key space into blocks of, say, 32 entries each. Then your hash function will hash only the upper (n-5) bits of the key, and each hash slot will be a bit array of 32 bits, which are indexed by the lower 5 bits of the key. Assuming your keys are relatively densely populated, this will save quite a significant amount of space. In the sparse case, you probably don't waste that much space either, since each slot is only one 32-bit int wide.
> 
> Problem is that keys will collide so each slot will have to have upper bits of key *and* the small set. The bucket itself is still inside intrusive linked list.

True.


> Otherwise the idea is great. About 3 words of overhead per word-sized set.
> 
> >
> >You can probably also increase the size of the slots to, say, 64 bits or more, if you want to maximize the usage to GC allocation block overhead ratio. Yes, each slot is only 1 int wide, but the GC does have to maintain bookkeeping information that's probably a lot more than that.
> 
> 16 bytes. Though you'd have to know how bucket entry looks like which is impossible with built-in hash map sadly.
[...]

Actually, the built-in AA's bucket struct is duplicated in object_.d; it's what the range-based API uses to walk the AA. (Yeah it's extremely ugly. It must be cleaned up.)


T

-- 
There is no gravity. The earth sucks.
November 12, 2012
On Monday, November 12, 2012 15:43:57 monarch_dodra wrote:
> Thoughts?

For now, if you need a set, then use RedBlackTree. Long term, a hash set should be added to std.container. As cool as it is to have AAs built into the language, they have been a disaster on a number of levels, and they buy you very little over having a library type. So, I see no reason to complicate the language further buy trying to add a hash set to it (especially considering all of the issues with AAs).

- Jonathan M Davis
November 12, 2012
On 2012-11-12 16:10, Andrej Mitrovic wrote:
> On 11/12/12, monarch_dodra <monarchdodra@gmail.com> wrote:
>> But of course, a library solution would be good too.
>
> dcollections has sets http://www.dsource.org/projects/dcollections
> (not sure if it still compiles though)
>

And Tango:

https://github.com/SiegeLord/Tango-D2
http://dsource.org/projects/tango/docs/current

-- 
/Jacob Carlborg
November 12, 2012
On Mon, Nov 12, 2012 at 08:07:08PM +0100, Jonathan M Davis wrote:
> On Monday, November 12, 2012 15:43:57 monarch_dodra wrote:
> > Thoughts?
> 
> For now, if you need a set, then use RedBlackTree. Long term, a hash set should be added to std.container. As cool as it is to have AAs built into the language, they have been a disaster on a number of levels, and they buy you very little over having a library type. So, I see no reason to complicate the language further buy trying to add a hash set to it (especially considering all of the issues with AAs).
[...]

I contend that the problem with built-in AA's is their implementation, not the fact that they're built-in.


T

-- 
Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.
November 12, 2012
On Monday, November 12, 2012 11:36:38 H. S. Teoh wrote:
> I contend that the problem with built-in AA's is their implementation, not the fact that they're built-in.

Oh, I agree, but we, as a group, have obviously failed to handle the implementation of the built-in AAs properly, and I think that the indications are definitely that it's harder to get the built-in AAs right that it would be to create a library solution.

At this point, AAs are part of the language and need to be fixed, but I see no reason to add anything else like them to the language. It's just not worth the trouble.

- Jonathan M Davis