Jump to page: 1 2
Thread overview
Associative Array: reasonable limits?
Nov 02, 2013
Charles Hixson
Nov 02, 2013
TheFlyingFiddle
Nov 02, 2013
Charles Hixson
Nov 03, 2013
TheFlyingFiddle
Nov 03, 2013
Charles Hixson
Nov 03, 2013
TheFlyingFiddle
Nov 03, 2013
Dmitry Olshansky
Nov 04, 2013
Charles Hixson
Nov 05, 2013
Froglegs
Nov 05, 2013
Charles Hixson
Nov 05, 2013
Dmitry Olshansky
Nov 05, 2013
Charles Hixson
Nov 06, 2013
Dmitry Olshansky
November 02, 2013
I'm contemplating an associative array that will eventually grow to be an estimated 64KB in size, assuming it's about half full.  It would then be holding around 90,000 entries.  Is this reasonable, or should I go with a sorted array, and do binary searches?  I estimate that binary searches would average around 15 accesses/hit, but that's still a lot faster than disk access.  The array would be nearly full, so there would be less wasted space, but, of course, insertions would become quite expensive.  (Deletions aren't expected.)  In either case the keys would be fixed length character arrays, and the values would also be simple structs without reference types (so the garbage collector could pretty much ignore things).

FWIW I'm expecting this array to be around the entire time the program is running, but it would probably be rolled out most of the time, as frequently accessed items would be pulled into a much smaller structure, and if I go with the array rather than the hash table I may only do updates as a batch.  (Since it's linear with the size of the array, but essentially ignores the number of items being added.)

My temptation is to go with the hash table...but I'm not sure of the underlying mechanics.

-- 
Charles Hixson

November 02, 2013
What are you going to be using the collection for?

November 02, 2013
On 11/02/2013 04:49 PM, TheFlyingFiddle wrote:
> What are you going to be using the collection for?
>
>
It's basically a lookup table used for translating external codes (essentially arbitrary) into id#s used internally.  There are LOTS of id#s that don't correspond to ANY external code, and nearly all the processing is done only using those id#s.  The codes are used only during i/o, as they are more intelligible to people.

-- 
Charles Hixson

November 03, 2013
On Saturday, 2 November 2013 at 23:58:07 UTC, Charles Hixson wrote:
> On 11/02/2013 04:49 PM, TheFlyingFiddle wrote:
>> What are you going to be using the collection for?
>>
>>
> It's basically a lookup table used for translating external codes (essentially arbitrary) into id#s used internally.  There are LOTS of id#s that don't correspond to ANY external code, and nearly all the processing is done only using those id#s.  The codes are used only during i/o, as they are more intelligible to people.

Well if the codes are only used during i/o the performance of the collection is not rly that important. Since the bottleneck is the i/o anyways.

Saying that my recommendation is to go with the AA since it's rly simple to work with. It also has the benefit that if you have a good hashing function lookup time will generally be O(1). Also this eleminates the need for building a custom datastructure.

It is my understanding that there are no hard limits on the number of elements in the built in AA.(But i canno't say for sure, it's very simple to test though).

Since the collection is never going to be deleted. If the external codes are known at compiletime i would recomend you to generate the collection at compiletime thus avoiding the runtime cost of building the table. (This works for either AA or Sorted Array) Also this eleminates the need to care about insertion times (unless you get unreasonable build times ofc ^^).

Hope this helped a little. Its hard to say if AA or SA is the way to go without profiling the complete application.

November 03, 2013
On 11/02/2013 05:16 PM, TheFlyingFiddle wrote:
> On Saturday, 2 November 2013 at 23:58:07 UTC, Charles Hixson wrote:
>> On 11/02/2013 04:49 PM, TheFlyingFiddle wrote:
>>> What are you going to be using the collection for?
>>>
>>>
>> It's basically a lookup table used for translating external codes (essentially arbitrary) into id#s used internally.  There are LOTS of id#s that don't correspond to ANY external code, and nearly all the processing is done only using those id#s.  The codes are used only during i/o, as they are more intelligible to people.
>
> Well if the codes are only used during i/o the performance of the collection is not rly that important. Since the bottleneck is the i/o anyways.
>
> Saying that my recommendation is to go with the AA since it's rly simple to work with. It also has the benefit that if you have a good hashing function lookup time will generally be O(1). Also this eleminates the need for building a custom datastructure.
>
> It is my understanding that there are no hard limits on the number of elements in the built in AA.(But i canno't say for sure, it's very simple to test though).
>
> Since the collection is never going to be deleted. If the external codes are known at compiletime i would recomend you to generate the collection at compiletime thus avoiding the runtime cost of building the table. (This works for either AA or Sorted Array) Also this eleminates the need to care about insertion times (unless you get unreasonable build times ofc ^^).
>
> Hope this helped a little. Its hard to say if AA or SA is the way to go without profiling the complete application.
>
>
Well, I didn't expect any hard limits, but it isn't easy to test this in context.  I'm pretty sure an array implementation could be run without impacting the rest of the program (which hasn't yet been written).  It's slower, but it's less resource intensive, and it's even smaller.  If AAs don't have any hidden penalty, like causing the garbage collector to thrash, then that's the way to go (and then, yes, I'll need to test it, but I'd really be surprised if THAT step caused a problem).  IIRC I've used larger AAs in simple programs without problems, but this one will probably run continually for months, and the AA is only one piece, so I'm a bit concerned about things that I can't see any easy way to test.  So I thought I'd ask.
P.S.:  The external codes are NOT known at compile time.  Not even at the start of run time.  They are added incrementally, and there is no actual rule about what they will be.  (This is a part of the reason for all the "about" and "estimated" in my original statement (unless I edited them out).)
  And there will actually be two of the structures of slightly differing composition and size, but they're essentially the same in use.  There's also be a couple of others that are rather different in this particular section of the program.  Access to all of these will be via a handler class, so I may make all the mentioned tables private and handle them with internal classes.  But this is probably more about penguins than you really want to know.

So it sounds like I should plan on using AAs, and just keep the array option around as a fallback position.

-- 
Charles Hixson

November 03, 2013
> If AAs don't have any hidden penalty, like causing the garbage collector to thrash, then that's the way to go

Any time you insert an item into an AA it might allocate. So it might cause a garbage collection cycle. I'm unsure what you mean by "causing the garbage collector to trash" memory leeaks? If it's not acceptable that the gc will collect then either build a custom hash map class with determenistic memory
management or implement your sorted array version.

Note: Lookup into AA's does not allocate

> I've used larger AAs in simple programs without problems, but this one will probably run continually for months.

I have never written anything that is supposed to run for months at a time so take anything i say with a grain of salt.

> is only one piece, so I'm a bit concerned about things that I can't see any easy way to test.  So I thought I'd ask.
> P.S.:  The external codes are NOT known at compile time.  Not even at the start of run time. They are added incrementally, and there is no actual rule about what they will be.  (This is a part of the reason for all the "about" and "estimated" in my original statement (unless I edited them out).)

Since the codes have no rule the best you can do for an hash function is to use a gernerall one, this might not be good enough for your purposes. You could test it with the build in string hash function on random strings of a specific length but it's no guarantee that it will work well for you specific data. It's however highly likely that it will still be faster then 15 misses to 1 hit ratio.

> So it sounds like I should plan on using AAs, and just keep the array option around as a fallback position.

I would aggre. If the extra memory and non-deterministisism is not acceptable go with the array.
November 03, 2013
03-Nov-2013 02:37, Charles Hixson пишет:
> I'm contemplating an associative array that will eventually grow to be
> an estimated 64KB in size, assuming it's about half full.  It would then
> be holding around 90,000 entries.  Is this reasonable, or should I go
> with a sorted array, and do binary searches?  I estimate that binary
> searches would average around 15 accesses/hit, but that's still a lot
> faster than disk access.  The array would be nearly full, so there would
> be less wasted space, but, of course, insertions would become quite
> expensive.  (Deletions aren't expected.)  In either case the keys would
> be fixed length character arrays, and the values would also be simple
> structs without reference types (so the garbage collector could pretty
> much ignore things).

90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one).

At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.

> FWIW I'm expecting this array to be around the entire time the program
> is running, but it would probably be rolled out most of the time, as
> frequently accessed items would be pulled into a much smaller structure,
> and if I go with the array rather than the hash table I may only do
> updates as a batch.  (Since it's linear with the size of the array, but
> essentially ignores the number of items being added.)
>
> My temptation is to go with the hash table...but I'm not sure of the
> underlying mechanics.

Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger.
In return lookups/inserts/deletions are cheap O(1) with high probability.

-- 
Dmitry Olshansky
November 04, 2013
On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
> 03-Nov-2013 02:37, Charles Hixson пишет:
>> I'm contemplating an associative array that will eventually grow to be
>> an estimated 64KB in size, assuming it's about half full.  It would then
>> be holding around 90,000 entries.  Is this reasonable, or should I go
>> with a sorted array, and do binary searches?  I estimate that binary
>> searches would average around 15 accesses/hit, but that's still a lot
>> faster than disk access.  The array would be nearly full, so there would
>> be less wasted space, but, of course, insertions would become quite
>> expensive.  (Deletions aren't expected.)  In either case the keys would
>> be fixed length character arrays, and the values would also be simple
>> structs without reference types (so the garbage collector could pretty
>> much ignore things).
>
> 90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one).
>
> At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.
>
>> FWIW I'm expecting this array to be around the entire time the program
>> is running, but it would probably be rolled out most of the time, as
>> frequently accessed items would be pulled into a much smaller structure,
>> and if I go with the array rather than the hash table I may only do
>> updates as a batch.  (Since it's linear with the size of the array, but
>> essentially ignores the number of items being added.)
>>
>> My temptation is to go with the hash table...but I'm not sure of the
>> underlying mechanics.
>
> Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger.
> In return lookups/inserts/deletions are cheap O(1) with high probability.
>
Thanks, I'd been assuming that it would just be around 128 bytes larger (i.e., two pointers) and around a 50% fill rate.  Sounds like a bit worse than I was estimating...not hugely worse though.  OTOH, that's an argument in favor of using a small cache that is a hash table, and a large array as a backing store, where infrequently accessed items could be kept.  Probably if the cache could hold around 2,000 entries it would have over a 90% hit rate (though that would need to be tested).  And the array would also provide a sorted access method for traversing it.  I'm not sure that's important, but it could be.  It could also be saved to an externally editable file (i.e., using a standard text editor) which also might be important.
This will, naturally, slow down insertions. (I don't expect any deletions, as they would render historical data unintelligible.)

Perhaps I'll batch the insertions, as the insertion is linear WRT the size of the array, but is essentially independent of the number of insertions.  (I.e., no matter how many insertions, it's a one pass job.)

-- 
Charles Hixson

November 05, 2013
why are you obsessing over a 64k hash table... that is tiny..
November 05, 2013
On 11/04/2013 04:22 PM, Froglegs wrote:
> why are you obsessing over a 64k hash table... that is tiny..
>
Probably because I'm an old fogey, and to me 64KB sounds like all the RAM a computer is likely to have.

-- 
Charles Hixson

« First   ‹ Prev
1 2