Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
December 17, 2013 [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Another OT thread to pick your brains. :)
What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.
I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks.
I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?
T
--
Computers are like a jungle: they have monitor lizards, rams, mice, c-moss, binary trees... and bugs.
|
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space). Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
|
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote: > Another OT thread to pick your brains. :) > > What's a good, efficient file structure for storing extremely large > lookup tables? (Extremely large as in > 10 million entries, with keys > and values roughly about 100 bytes each.) The structure must support > efficient adding and lookup of entries, as these two operations will be > very frequent. > > I did some online research, and it seems that hashtables perform poorly > on disk, because the usual hash functions cause random scattering of > related data (which are likely to be access with higher temporal > locality), which incurs lots of disk seeks. > > I thought about B-trees, but they have high overhead (and are a pain to > implement), and also only exhibit good locality if table entries are > accessed sequentially; the problem is I'm working with high-dimensional > data and the order of accesses is unlikely to be sequential. However, > they do exhibit good spatial locality in higher-dimensional space (i.e., > if entry X is accessed first, then the next entry Y is quite likely to > be close to X in that space). Does anybody know of a good data > structure that can take advantage of this fact to minimize disk > accesses? > > > T A burst trie might work, although it is originally designed for text. I haven't had a chance to try one out in code yet but the paper by Heinz, Zobel, and Williams is interesting. Burst Tries: A Fast, Efficient Data Structure for String Keys (2002) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499 Joseph |
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote: > Another OT thread to pick your brains. :) > > What's a good, efficient file structure for storing extremely large > lookup tables? (Extremely large as in > 10 million entries, with keys > and values roughly about 100 bytes each.) The structure must support > efficient adding and lookup of entries, as these two operations will be > very frequent. > > I did some online research, and it seems that hashtables perform poorly > on disk, because the usual hash functions cause random scattering of > related data (which are likely to be access with higher temporal > locality), which incurs lots of disk seeks. > > I thought about B-trees, but they have high overhead (and are a pain to > implement), and also only exhibit good locality if table entries are > accessed sequentially; the problem is I'm working with high-dimensional > data and the order of accesses is unlikely to be sequential. However, > they do exhibit good spatial locality in higher-dimensional space (i.e., > if entry X is accessed first, then the next entry Y is quite likely to > be close to X in that space). Does anybody know of a good data > structure that can take advantage of this fact to minimize disk > accesses? > > > T As a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question. Another question is can your queries be batched? If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well. However, if you thought implementing a B-tree was going to be painful, that might not appeal to you. If you don't want to implement that yourself you could look at TPIE: http://www.madalgo.au.dk/tpie/ Although it is in C++. If I had to design something quick on the spot, my first guess would be to use a grid on the first two dimensions and then bin the 'points' or keys within each grid square and build a simpler structure on those. This won't work so well though for really high dimension data or if the 'points' are randomly distributed. Also, what exactly do you mean by "in that space" when you say: "if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space". Do you mean that the value of Y in the next dimension is numerically close (or expected to be) to X? Cheers, Craig |
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to tn | On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote: > On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote: >> However, >> they do exhibit good spatial locality in higher-dimensional space (i.e., >> if entry X is accessed first, then the next entry Y is quite likely to >> be close to X in that space). Does anybody know of a good data >> structure that can take advantage of this fact to minimize disk >> accesses? > > Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk? Its not cache-oblivious buy he could try the KDB tree: http://dl.acm.org/citation.cfm?id=582321 or, again for batched queries the is the Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch. |
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?
|
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | It's not a datastructure persay, but Google's LevelDB is very good, quite fast arbitrary key-value storage system (I use it in my project relating a 1 word key to a 200 word value) supporting batch transactional operations. Think of it as a key-value storage version of SQLite. https://code.google.com/p/leveldb/ And then the D bindings: https://github.com/bheads/d-leveldb And a D wrapper providing a nicer interface to LevelDB: https://github.com/bheads/leveldb |
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 17.12.2013. 20:08, H. S. Teoh wrote: > Another OT thread to pick your brains. :) > > What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. > > I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. > > I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? > > > T > sqlite file format seems to be fairly documented: http://www.sqlite.org/fileformat.html |
December 17, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Dillabaugh | On Tue, Dec 17, 2013 at 08:54:17PM +0100, Craig Dillabaugh wrote: > On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote: > >On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote: > >>However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? > > > >Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk? > > Its not cache-oblivious buy he could try the KDB tree: > > http://dl.acm.org/citation.cfm?id=582321 > > or, again for batched queries the is the > > Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch. I skimmed over the KDB tree paper, and am reading the Bkd-tree paper now; it appears that the latter is an improvement over the former, and has better update performance characteristics. However, it also appears much more complicated to implement. Another point I forgot to mention, is that currently I only ever perform insertion and queries, and the two can actually be combined (basically the table is kept for uniqueness testing). I don't need to delete entries from the table. Eventually I might look into that to reduce disk space requirements, but as far as the actual algorithm is concerned, it's not necessary. So there's potentially a good amount of room here for optimization by foregoing the ability to delete entries. I'm currently considering using simpler, "dumber" file structures for storing nodes, with an in-memory Bloom filter to eliminate a (hopefully) large number of I/O roundtrips, since I'm expecting that a good fraction of queries will return 'not found'. T -- "A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc. |
December 18, 2013 Re: [OT] Efficient file structure for very large lookup tables? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
What kind of key do you have? vector of floats? integers?
How about some uberfast compression algorithm (eg: LZ4) to compress pages when you have to offload data to the disk?
|
Copyright © 1999-2021 by the D Language Foundation