[OT] Efficient file structure for very large lookup tables? (page 2)

December 18, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by H. S. Teoh
in reply to Craig Dillabaugh

Permalink

H. S. Teoh

Posted in reply to Craig Dillabaugh

Permalink

On Wed, Dec 18, 2013 at 03:03:20PM +0100, Craig Dillabaugh wrote:
> On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
> >On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
> 
> >>Another question is can your queries be batched?  If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well.
> >
> >Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly.  But they *are* somewhat predictable (spatial locality: see below), so maybe that might help?
> >
> I should clarify a bit what I meant by batched.  If you have a sequence of queries q_1, q_2, .... q_n (where I guess in your case a query can be/is an insertion), then by 'batch' proccess what I mean is you do not need to get the answers on the fly. It is sufficient to have all queries answered when the program terminates?

Hmm. This is an interesting thought. In a sense, queries don't have to be answered immediately; the program can generate a series of queries and then continue processing the next item, the answer can be asynchronous. When the answer becomes available, then further processing will happen (add new items to be processed, update existing items, etc.). The only requirement is that queries will always be answered in the order they were made (the correctness of the algorithm depends on this).

In a sense, what I need is really just a single operation "add-or-update", which can be processed in bulk. Either an item is already in the table, in which case it gets updated in a certain way, or it isn't, in which case it is added. In both cases, some additional processing takes place after the table is updated, which may create more items to be processed.

> Note that these techniques still account for the 'order' of the queries, so the answer to query q_i will be reported correctly (ie. same as if the queries were performed sequentially on a standard data structure).

Then this is OK.

> Also, if you are still considering using hashing the following paper may be of interest to you (I haven't read it in detail, but it doesn't look too crazy)
> 
> http://arxiv.org/abs/1104.2799

Thanks for the references! I'll look into this, it seems quite relevant.

T

-- 
You have to expect the unexpected. -- RL

Forums