[OT] Efficient file structure for very large lookup tables? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [OT] Efficient file structure for very large lookup tables?

Thread overview

[OT] Efficient file structure for very large lookup tables?
Dec 17, 2013 H. S. Teoh
Dec 17, 2013 tn
Dec 17, 2013 Craig Dillabaugh
Dec 17, 2013 H. S. Teoh
Dec 17, 2013 Joseph Cassman
Dec 17, 2013 Craig Dillabaugh
Dec 17, 2013 Ola Fosheim Grøstad
Dec 17, 2013 Dylan Knutson
Dec 17, 2013 luka8088
Dec 18, 2013 Francesco Cattoglio

December 17, 2013

[OT] Efficient file structure for very large lookup tables?

Posted by H. S. Teoh

H. S. Teoh

Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.

I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space).  Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?


T

-- 
Computers are like a jungle: they have monitor lizards, rams, mice, c-moss, binary trees... and bugs.

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by tn
in reply to H. S. Teoh

tn

Posted in reply to H. S. Teoh

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?

Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by Joseph Cassman
in reply to H. S. Teoh

Joseph Cassman

Posted in reply to H. S. Teoh

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
>
> I did some online research, and it seems that hashtables perform poorly
> on disk, because the usual hash functions cause random scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
>
> I thought about B-trees, but they have high overhead (and are a pain to
> implement), and also only exhibit good locality if table entries are
> accessed sequentially; the problem is I'm working with high-dimensional
> data and the order of accesses is unlikely to be sequential. However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
>
>
> T

A burst trie might work, although it is originally designed for text. I haven't had a chance to try one out in code yet but the paper by Heinz, Zobel, and Williams is interesting.

Burst Tries: A Fast, Efficient Data Structure for String Keys (2002)
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499

Joseph

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by Craig Dillabaugh
in reply to H. S. Teoh

Craig Dillabaugh

Posted in reply to H. S. Teoh

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
>
> I did some online research, and it seems that hashtables perform poorly
> on disk, because the usual hash functions cause random scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
>
> I thought about B-trees, but they have high overhead (and are a pain to
> implement), and also only exhibit good locality if table entries are
> accessed sequentially; the problem is I'm working with high-dimensional
> data and the order of accesses is unlikely to be sequential. However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
>
>
> T

As a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)?  Or is that out of the question.

Another question is can your queries be batched?  If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well. However, if you thought implementing a B-tree was going to be painful, that might not appeal to you.  If you don't want to implement that yourself you could look at TPIE:

http://www.madalgo.au.dk/tpie/

Although it is in C++.

If I had to design something quick on the spot, my first guess would be to use a grid on the first two dimensions and then bin the 'points' or keys within each grid square and build a simpler structure on those.  This won't work so well though for really high dimension data or if the 'points' are randomly distributed.

Also, what exactly do you mean by "in that space" when you say:

"if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space".

Do you mean that the value of Y in the next dimension is numerically close (or expected to be) to X?

Cheers,

Craig

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by Craig Dillabaugh
in reply to tn

Craig Dillabaugh

Posted in reply to tn

On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
>> However,
>> they do exhibit good spatial locality in higher-dimensional space (i.e.,
>> if entry X is accessed first, then the next entry Y is quite likely to
>> be close to X in that space).  Does anybody know of a good data
>> structure that can take advantage of this fact to minimize disk
>> accesses?
>
> Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?

Its not cache-oblivious buy he could try the KDB tree:

http://dl.acm.org/citation.cfm?id=582321

or, again for batched queries the is the

Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch.

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by Ola Fosheim Grøstad
in reply to H. S. Teoh

Ola Fosheim Grøstad

Posted in reply to H. S. Teoh

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.

But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by Dylan Knutson
in reply to H. S. Teoh

Dylan Knutson

Posted in reply to H. S. Teoh

It's not a datastructure persay, but Google's LevelDB is very good, quite fast arbitrary key-value storage system (I use it in my project relating a 1 word key to a 200 word value) supporting batch transactional operations.

Think of it as a key-value storage version of SQLite.

https://code.google.com/p/leveldb/

And then the D bindings:
https://github.com/bheads/d-leveldb

And a D wrapper providing a nicer interface to LevelDB:
https://github.com/bheads/leveldb

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by luka8088
in reply to H. S. Teoh

luka8088

Posted in reply to H. S. Teoh

On 17.12.2013. 20:08, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
> 
> What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.
> 
> I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks.
> 
> I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space).  Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?
> 
> 
> T
> 

sqlite file format seems to be fairly documented: http://www.sqlite.org/fileformat.html

December 17, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by H. S. Teoh
in reply to Craig Dillabaugh

H. S. Teoh

Posted in reply to Craig Dillabaugh

On Tue, Dec 17, 2013 at 08:54:17PM +0100, Craig Dillabaugh wrote:
> On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
> >On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> >>However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space).  Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?
> >
> >Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
> 
> Its not cache-oblivious buy he could try the KDB tree:
> 
> http://dl.acm.org/citation.cfm?id=582321
> 
> or, again for batched queries the is the
> 
> Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch.

I skimmed over the KDB tree paper, and am reading the Bkd-tree paper now; it appears that the latter is an improvement over the former, and has better update performance characteristics. However, it also appears much more complicated to implement.

Another point I forgot to mention, is that currently I only ever perform insertion and queries, and the two can actually be combined (basically the table is kept for uniqueness testing). I don't need to delete entries from the table. Eventually I might look into that to reduce disk space requirements, but as far as the actual algorithm is concerned, it's not necessary. So there's potentially a good amount of room here for optimization by foregoing the ability to delete entries.

I'm currently considering using simpler, "dumber" file structures for storing nodes, with an in-memory Bloom filter to eliminate a (hopefully) large number of I/O roundtrips, since I'm expecting that a good fraction of queries will return 'not found'.

T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc.

December 18, 2013

Re: [OT] Efficient file structure for very large lookup tables?

Posted by Francesco Cattoglio
in reply to H. S. Teoh

Francesco Cattoglio

Posted in reply to H. S. Teoh

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.

What kind of key do you have? vector of floats? integers?
How about some uberfast compression algorithm (eg: LZ4) to compress pages when you have to offload data to the disk?

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation