View mode: basic / threaded / horizontal-split · Log in · Help
November 12, 2012
Re: hashed array?
On Mon, Nov 12, 2012 at 08:27:51PM +0400, Dmitry Olshansky wrote:
> 11/12/2012 6:43 PM, monarch_dodra пишет:
> >D has a built in hashed associative container, which is great.
> >
> 
> 
> No. The literals are nice though.
> 
> >I've noticed though it has no built-in "hashed container" for when
> >you just need to store unique elements, but without any specific data
> >associated (lists of names, for example).
> 
> There is no need for built-in containers. If anything there is plan
> to move hash out of built-ins. The fact that it utterly failed to
> happen doesn't prove anything as it still has to be the library
> type.
[...]

I'm still hoping to get back to moving AA into the library one day.
There are a lot of open bugs with the current AA implementation that
stem from the fact that it's in two different places, and the core AA
code has no access to value types.


T

-- 
Klein bottle for rent ... inquire within. -- Stephen Mulraney
November 12, 2012
Re: hashed array?
On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
>> The current "workaround" is to just dummy it.
>> alias bool Dummy;
>> Dummy[string] names;
>> names["Paul"] = true; //Ew.
>
> Make a wrapper?

The problem with wrapped versions of associative arrays is that they just don't 
scale with number of keys.  If I want to store (say) a set of 3 million 
size_t's, then it costs a lot more than 3_000_000 * size_t.sizeof() to do so 
this way.

Do dedicated set implementations do better than this?

I should say that my own interests in an implementation of set are twofold -- 
first, efficient storage; and second, being able to do an efficient foreach 
across the stored values, without any concern for order.

> Or rather call them sets and have them in the library.

The whole builtin-vs.-library thing seems a big distraction -- if something 
needs to be implemented, the library seems the natural place unless there's a 
very good reason otherwise.
November 12, 2012
Re: hashed array?
11/12/2012 8:52 PM, Joseph Rushton Wakeling пишет:
> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
>>> The current "workaround" is to just dummy it.
>>> alias bool Dummy;
>>> Dummy[string] names;
>>> names["Paul"] = true; //Ew.
>>
>> Make a wrapper?
>
> The problem with wrapped versions of associative arrays is that they
> just don't scale with number of keys.  If I want to store (say) a set of
> 3 million size_t's, then it costs a lot more than 3_000_000 *
> size_t.sizeof() to do so this way.
>
> Do dedicated set implementations do better than this?

Yes, sure they can. Starting with e.g. a bitset.
If the distribution is sparce then inversion lists can do wonders. They 
work especially well if your distribution has contiguous intervals, 
searching would be logN though.

I use one for unicode character sets and it saves a LOT of space.

Otherwise if it's sparce but doesn't have contiguous interval it seems 
entirely doable to combine the two in a universal datastructure for 
integral sets. Say a sorted array of 32/64-element bitsets (depending on 
architecture):

//denotes a small bit set covering interval of [start..start+32/64)
struct node{
	size_t start;
	size_t bits;
};

Should be quite compact & fast.

>
> I should say that my own interests in an implementation of set are
> twofold -- first, efficient storage; and second, being able to do an
> efficient foreach across the stored values, without any concern for order.
>
>> Or rather call them sets and have them in the library.
>
> The whole builtin-vs.-library thing seems a big distraction -- if
> something needs to be implemented, the library seems the natural place
> unless there's a very good reason otherwise.


-- 
Dmitry Olshansky
November 12, 2012
Re: hashed array?
On Mon, Nov 12, 2012 at 05:52:32PM +0100, Joseph Rushton Wakeling wrote:
> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
> >>The current "workaround" is to just dummy it.
> >>alias bool Dummy;
> >>Dummy[string] names;
> >>names["Paul"] = true; //Ew.
> >
> >Make a wrapper?
> 
> The problem with wrapped versions of associative arrays is that they
> just don't scale with number of keys.  If I want to store (say) a
> set of 3 million size_t's, then it costs a lot more than 3_000_000 *
> size_t.sizeof() to do so this way.
>
> Do dedicated set implementations do better than this?

Probably.

For storing integer types, you could optimize for space by dividing your
key space into blocks of, say, 32 entries each. Then your hash function
will hash only the upper (n-5) bits of the key, and each hash slot will
be a bit array of 32 bits, which are indexed by the lower 5 bits of the
key. Assuming your keys are relatively densely populated, this will save
quite a significant amount of space. In the sparse case, you probably
don't waste that much space either, since each slot is only one 32-bit
int wide.

You can probably also increase the size of the slots to, say, 64 bits or
more, if you want to maximize the usage to GC allocation block overhead
ratio. Yes, each slot is only 1 int wide, but the GC does have to
maintain bookkeeping information that's probably a lot more than that.
Allocating a large number of very small blocks is generally worse than a
smaller number of medium-sized blocks.

If your keys are *very* dense, then storing ranges instead of individual
elements is the way to go (though the implementation will be
significantly more complex).

For storing sets of strings, though, you'd want something like a trie
instead.

It all depends on your specific application. So a per application
implementation might not be out of place.


> I should say that my own interests in an implementation of set are
> twofold -- first, efficient storage; and second, being able to do an
> efficient foreach across the stored values, without any concern for
> order.

With the hash + bit-array approach above, foreach would be very simple
(just walk the hash table slots and enumerate the bits in each slot).


> >Or rather call them sets and have them in the library.
> 
> The whole builtin-vs.-library thing seems a big distraction -- if
> something needs to be implemented, the library seems the natural
> place unless there's a very good reason otherwise.

Yeah, there's definitely an advantage to minimizing the built-in stuff
and keeping them in the library. Keeps the language relatively clean and
simple, but still provide convenient library types for common tasks.


T

-- 
Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.
November 12, 2012
Re: hashed array?
11/12/2012 9:44 PM, H. S. Teoh пишет:
> On Mon, Nov 12, 2012 at 05:52:32PM +0100, Joseph Rushton Wakeling wrote:
>> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
>>>> The current "workaround" is to just dummy it.
>>>> alias bool Dummy;
>>>> Dummy[string] names;
>>>> names["Paul"] = true; //Ew.
>>>
>>> Make a wrapper?
>>
>> The problem with wrapped versions of associative arrays is that they
>> just don't scale with number of keys.  If I want to store (say) a
>> set of 3 million size_t's, then it costs a lot more than 3_000_000 *
>> size_t.sizeof() to do so this way.
>>
>> Do dedicated set implementations do better than this?
>
> Probably.
>
> For storing integer types, you could optimize for space by dividing your
> key space into blocks of, say, 32 entries each. Then your hash function
> will hash only the upper (n-5) bits of the key, and each hash slot will
> be a bit array of 32 bits, which are indexed by the lower 5 bits of the
> key. Assuming your keys are relatively densely populated, this will save
> quite a significant amount of space. In the sparse case, you probably
> don't waste that much space either, since each slot is only one 32-bit
> int wide.

Problem is that keys will collide so each slot will have to have upper 
bits of key *and* the small set. The bucket itself is still inside 
intrusive linked list.

Otherwise the idea is great. About 3 words of overhead per word-sized set.

>
> You can probably also increase the size of the slots to, say, 64 bits or
> more, if you want to maximize the usage to GC allocation block overhead
> ratio. Yes, each slot is only 1 int wide, but the GC does have to
> maintain bookkeeping information that's probably a lot more than that.

16 bytes. Though you'd have to know how bucket entry looks like which is 
impossible with built-in hash map sadly.

> Allocating a large number of very small blocks is generally worse than a
> smaller number of medium-sized blocks.
>
[snip other good points]


-- 
Dmitry Olshansky
November 12, 2012
Re: hashed array?
On Mon, Nov 12, 2012 at 09:58:14PM +0400, Dmitry Olshansky wrote:
> 11/12/2012 9:44 PM, H. S. Teoh пишет:
[...]
> >For storing integer types, you could optimize for space by dividing
> >your key space into blocks of, say, 32 entries each. Then your hash
> >function will hash only the upper (n-5) bits of the key, and each
> >hash slot will be a bit array of 32 bits, which are indexed by the
> >lower 5 bits of the key. Assuming your keys are relatively densely
> >populated, this will save quite a significant amount of space. In the
> >sparse case, you probably don't waste that much space either, since
> >each slot is only one 32-bit int wide.
> 
> Problem is that keys will collide so each slot will have to have upper
> bits of key *and* the small set. The bucket itself is still inside
> intrusive linked list.

True.


> Otherwise the idea is great. About 3 words of overhead per word-sized
> set.
> 
> >
> >You can probably also increase the size of the slots to, say, 64 bits
> >or more, if you want to maximize the usage to GC allocation block
> >overhead ratio. Yes, each slot is only 1 int wide, but the GC does
> >have to maintain bookkeeping information that's probably a lot more
> >than that.
> 
> 16 bytes. Though you'd have to know how bucket entry looks like
> which is impossible with built-in hash map sadly.
[...]

Actually, the built-in AA's bucket struct is duplicated in object_.d;
it's what the range-based API uses to walk the AA. (Yeah it's extremely
ugly. It must be cleaned up.)


T

-- 
There is no gravity. The earth sucks.
November 12, 2012
Re: hashed array?
On Monday, November 12, 2012 15:43:57 monarch_dodra wrote:
> Thoughts?

For now, if you need a set, then use RedBlackTree. Long term, a hash set 
should be added to std.container. As cool as it is to have AAs built into the 
language, they have been a disaster on a number of levels, and they buy you 
very little over having a library type. So, I see no reason to complicate the 
language further buy trying to add a hash set to it (especially considering 
all of the issues with AAs).

- Jonathan M Davis
November 12, 2012
Re: hashed array?
On 2012-11-12 16:10, Andrej Mitrovic wrote:
> On 11/12/12, monarch_dodra <monarchdodra@gmail.com> wrote:
>> But of course, a library solution would be good too.
>
> dcollections has sets http://www.dsource.org/projects/dcollections
> (not sure if it still compiles though)
>

And Tango:

https://github.com/SiegeLord/Tango-D2
http://dsource.org/projects/tango/docs/current

-- 
/Jacob Carlborg
November 12, 2012
Re: hashed array?
On Mon, Nov 12, 2012 at 08:07:08PM +0100, Jonathan M Davis wrote:
> On Monday, November 12, 2012 15:43:57 monarch_dodra wrote:
> > Thoughts?
> 
> For now, if you need a set, then use RedBlackTree. Long term, a hash
> set should be added to std.container. As cool as it is to have AAs
> built into the language, they have been a disaster on a number of
> levels, and they buy you very little over having a library type. So, I
> see no reason to complicate the language further buy trying to add a
> hash set to it (especially considering all of the issues with AAs).
[...]

I contend that the problem with built-in AA's is their implementation,
not the fact that they're built-in.


T

-- 
Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.
November 12, 2012
Re: hashed array?
On Monday, November 12, 2012 11:36:38 H. S. Teoh wrote:
> I contend that the problem with built-in AA's is their implementation,
> not the fact that they're built-in.

Oh, I agree, but we, as a group, have obviously failed to handle the 
implementation of the built-in AAs properly, and I think that the indications 
are definitely that it's harder to get the built-in AAs right that it would be 
to create a library solution.

At this point, AAs are part of the language and need to be fixed, but I see no 
reason to add anything else like them to the language. It's just not worth the 
trouble.

- Jonathan M Davis
1 2 3 4
Top | Discussion index | About this forum | D home