Jump to page: 1 2
Thread overview
Re: [OT] Efficient file structure for very large lookup tables?
Dec 17, 2013
H. S. Teoh
Dec 17, 2013
John Colvin
Dec 18, 2013
Craig Dillabaugh
Dec 18, 2013
H. S. Teoh
Dec 17, 2013
H. S. Teoh
Dec 17, 2013
H. S. Teoh
Dec 18, 2013
H. S. Teoh
Dec 18, 2013
Brad Roberts
Dec 18, 2013
H. S. Teoh
December 17, 2013
On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> >Another OT thread to pick your brains. :)
> >
> >What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.
> >
> >I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks.
> >
> >I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential.  However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?
> >
> >
> >T
> 
> As a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)?  Or is that out of the question.

Well, currently I'm just using plain ole in-memory hash tables to store the data. It works quite well for small problems, but I'm finding that memory runs out quickly with problems of moderate size, so I need to move the storage to disk. I'm just looking to minimize I/O latency, which hash tables are notorious for. (A previous incarnation of the program uses Berkeley DB hash tables, but it becomes I/O bound and slow when the problem size is large.)


> Another question is can your queries be batched?  If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well.

Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help?


[...]
> If I had to design something quick on the spot, my first guess would be to use a grid on the first two dimensions and then bin the 'points' or keys within each grid square and build a simpler structure on those.  This won't work so well though for really high dimension data or if the 'points' are randomly distributed.
>
> Also, what exactly do you mean by "in that space" when you say:
> 
> "if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space".
> 
> Do you mean that the value of Y in the next dimension is numerically
> close (or expected to be) to X?
[...]

How many dimensions is considered "really high"?  The number of dimensions can be up to about 50 or so. Is that high? Also, the points are not randomly distributed; they are highly-connected (i.e., they tend to be adjacent to each other).

Basically, one possible representation of the keys is as an
n-dimensional array of integers. (They are not natively in that form,
but can be compressed into that form easily.) And the way the algorithm
works is that if the key (p1,p2,p3,...) is accessed, then the next key
(q1,q2,q3,...) accessed is likely to be differ from (p1,p2,p3,...) only
in a small number of coordinates, and the difference in coordinates is
likely to be very small (usually in the +1/-1 range).

Actually, now that I think of it... the problem may be far more tractable than I realized. I was previously thinking of some kind of n-dimensional space-partitioning storage scheme, where the state space is partitioned into n-dimensional cube buckets, and points are placed in their corresponding buckets. However, this had the problem that the n-cube contains 2^n vertices, so the bucket sizes would have to be k^n, and since I can't control n, it makes it difficult to map bucket sizes to page sizes for optimal I/O performance. For large n, the bucket size would be unmanageably big (e.g., in the 50-dimensional case).

But it just dawned on me that partitioning into n-cubes is unnecessary, because most of the time, I only care about points that differ in a small number of coordinates from the current point. So a more efficient bucket shape would be something closer to an n-cross (higher-dimensional equivalent of the octahedron), which only has 2*n vertices, or one of its derivatives. This makes bucket sizes far easier to control, and far more efficient to store. The tricky part is only, how to partition n-space into n-cross-shaped pieces? It would have to be some kind of tiling (otherwise there'd be ambiguity over which bucket a given point falls into). So now I've to do some research into n-dimensional tilings... :P


T

-- 
"The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."
December 17, 2013
On Tue, Dec 17, 2013 at 09:02:41PM +0100, digitalmars-d-bounces@puremagic.com wrote:
> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> >What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.
> 
> But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?

Hmm. You're right, I think I underestimated my data size. :P  The 10 million figure is based on problems that my current program successfully solved (with an in-memory hashtable). I guess that larger problems must far exceed that number (I ran out of memory so I couldn't check just how big it got before I killed the process). Suffice it to say that this is a combinatorial problem, so the number of entries grow exponentially; anything that can help reduce the storage requirements / I/O latency would be a big help.


T

-- 
VI = Visual Irritation
December 17, 2013
On Tuesday, 17 December 2013 at 20:54:56 UTC, H. S. Teoh wrote:
> big it got before I killed the process). Suffice it to say that this is
> a combinatorial problem, so the number of entries grow exponentially;
> anything that can help reduce the storage requirements / I/O latency
> would be a big help.

If you can buffer the queries in a queue then you can issue a prefetch-request to the OS to bring in the memory-page from disk when you put it into the queue to prevent the process from being put to sleep, the length of the queue has to be tailored to how fast it can load the page.

If the data is uniformly distributed then perhaps you could partition the disk-space with a n-dimensional grid, and then have a key-value store that you page into memory?

If you can do some queries out of order then you probably should set up some "buckets"/"bins" in the queue and prefetch the page that is referenced by the fullest/oldest bucket if you can do some queries out of order. Just to avoid that pages are pushed in and out of memory all the time.

Otherwise a kd-tree with a scaled grid per leaf that stays within a memorypage, probably could work, but that sounds like a lot of work. Implementing n-dimensional datastructures is conceptually easy, but tedious in reality (can you trust it to be bug free? It is hard to visualize in text.).

If you have to do the spatial datastructure yourself, then perhaps an octree or n-dimensional oc-tree would be helpful. You could make it shallow and in memory and use it both to buffer queries and to store indexes to disk-data. It would be 2^N nodes per level. (2D data has 4 nodes, 3D has 8 nodes)

I once read a paper of mapping GIS data to a regular database (mapping 2D to 1D) using hilbert curves. Not sure if that is of any help.

Hm.
December 17, 2013
On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
> On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
>> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
>> >Another OT thread to pick your brains. :)
>> >
>> >What's a good, efficient file structure for storing extremely large
>> >lookup tables? (Extremely large as in > 10 million entries, with keys
>> >and values roughly about 100 bytes each.) The structure must support
>> >efficient adding and lookup of entries, as these two operations will
>> >be very frequent.
>> >
>> >I did some online research, and it seems that hashtables perform
>> >poorly on disk, because the usual hash functions cause random
>> >scattering of related data (which are likely to be access with higher
>> >temporal locality), which incurs lots of disk seeks.
>> >
>> >I thought about B-trees, but they have high overhead (and are a pain
>> >to implement), and also only exhibit good locality if table entries
>> >are accessed sequentially; the problem is I'm working with
>> >high-dimensional data and the order of accesses is unlikely to be
>> >sequential.  However, they do exhibit good spatial locality in
>> >higher-dimensional space (i.e., if entry X is accessed first, then
>> >the next entry Y is quite likely to be close to X in that space).
>> >Does anybody know of a good data structure that can take advantage of
>> >this fact to minimize disk accesses?
>> >
>> >
>> >T
>> 
>> As a first attempt could you use a key-value database (like REDIS if
>> you have enough memory to fit everything in)?  Or is that out of the
>> question.
>
> Well, currently I'm just using plain ole in-memory hash tables to store
> the data. It works quite well for small problems, but I'm finding that
> memory runs out quickly with problems of moderate size, so I need to
> move the storage to disk. I'm just looking to minimize I/O latency,
> which hash tables are notorious for. (A previous incarnation of the
> program uses Berkeley DB hash tables, but it becomes I/O bound and slow
> when the problem size is large.)
>

I can't really add anything on the fancy data structures/algorithms, it's all a bit out of my league.

That said, have you considered asynchronously maintaining the locality in memory, i.e. prefetching? If you have predictable access and good locality it should perform well. In particular, you can build a prefetcher that is tuned to the particular access pattern.
December 17, 2013
On Tue, Dec 17, 2013 at 09:43:37PM +0100, Dylan Knutson wrote:
> It's not a datastructure persay, but Google's LevelDB is very good, quite fast arbitrary key-value storage system (I use it in my project relating a 1 word key to a 200 word value) supporting batch transactional operations.
[...]

I'm not looking for a transactional DB, though; I'm looking for a fast key-value storage system with very high insert/query throughput, and hopefully good space requirements.  Deletion is not needed, though for very large problems it might be useful to reduce disk space usage. It must be able to handle huge numbers of insertions (say 10 million entries for small/moderate problem instances, perhaps in the billions or trillions, if I use it on a large problem instance), and must be able to do so very fast -- the faster the better.

Preferably, it would take advantage of the n-dimensional locality that I described in my other post. Basically, the idea is to reduce I/O roundtrips by caching disk pages in memory, because there are far more entries than will fit in memory all at once. But since the number of entries is very large, to prevent thrashing I have to maximize the number of cache hits. It does no good if I have to load 100 disk pages to answer 100 queries; if 100 pages fit in RAM, I'd like most of the entries in those 100 pages to be used in answering queries before they get swapped out for new pages to be loaded in. In the ideal case (probably not attainable), those 100 pages will contain exactly the next, say, million entries I'm about to query, so that once I'm done with them, I can swap those pages out and never have to load them back in again, giving the space for the next 100 pages to be loaded and queried. So the layout of the disk pages should be as closely corresponding to the order the algorithm will be looking up entries as possible. (Unfortunately it's not 100% predictable, as it depends on the problem being solved, but at least there are general trends we can take advantage of.) A generic storage scheme like SQLite will probably not perform as fast as I'd like.


T

-- 
It is impossible to make anything foolproof because fools are so ingenious. -- Sammy
December 18, 2013
On Tuesday, 17 December 2013 at 23:35:40 UTC, H. S. Teoh wrote:
> to answer 100 queries; if 100 pages fit in RAM, I'd like most

On OS-X a page is 4K, so there are only 20 entries per page. If you can get hold of a disk that is  SSD (to avoid rotational latency) and use a buffer/queue while waiting then it you should be ok?
December 18, 2013
On Wed, Dec 18, 2013 at 01:14:58AM +0100, Francesco Cattoglio wrote:
> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> >Another OT thread to pick your brains. :)
> >
> >What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.
> 
> What kind of key do you have? vector of floats? integers?

Vector of (relatively small) integers.


> How about some uberfast compression algorithm (eg: LZ4) to compress pages when you have to offload data to the disk?

That's a good idea, actually, since the keys are very similar to each other, so should compress very well when many are placed together in a single page.


T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
December 18, 2013
If you can afford to do a one time sort of both the sets of data, then you can do a very simple single pass intersect through both sets.  Chances are that wont work.  So, a potential alternative, use that approach combined with a smaller delta since last sort, hopefully small enough to fit in memory in a nice convenient to probe structure.  Periodically, as the delta grows too big, merge it in with the primary data file.

Try looking at the access patterns to see if there's something you can do to adjust them to make the storage problem simpler.

On 12/17/13 11:08 AM, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
>
> I did some online research, and it seems that hashtables perform poorly
> on disk, because the usual hash functions cause random scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
>
> I thought about B-trees, but they have high overhead (and are a pain to
> implement), and also only exhibit good locality if table entries are
> accessed sequentially; the problem is I'm working with high-dimensional
> data and the order of accesses is unlikely to be sequential. However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
>
>
> T
>

December 18, 2013
On Tue, Dec 17, 2013 at 05:09:43PM -0800, Brad Roberts wrote:
> If you can afford to do a one time sort of both the sets of data, then you can do a very simple single pass intersect through both sets.  Chances are that wont work.  So, a potential alternative, use that approach combined with a smaller delta since last sort, hopefully small enough to fit in memory in a nice convenient to probe structure.  Periodically, as the delta grows too big, merge it in with the primary data file.

The problem is that the data is generated by the algorithm as it runs. If it was known beforehand, the problem would be much simpler. :) (Well, sortof... the reason it's generated on-the-fly is that the full data set is far too big to store anywhere -- the algorithm only generates the subset that is needed for it to run, and that's the subset that needs to be stored in the table.)


> Try looking at the access patterns to see if there's something you can do to adjust them to make the storage problem simpler.
[...]

I'm currently researching alternative ways of partitioning n-space, that may hopefully allow a high hit-rate caching of disk pages. When n is large, partitioning into n-cubes has very poor cache hit rates, due to the (counterintuitive) fact that in high dimensions most of the n-cube's volume lies near its vertices, yet the part most likely to be looked up by the algorithm next lies near the coordinate axes. Since the n-cube's volume grows exponentially with n, and most of that volume lies away from the coordinate axes, almost all of the cached page will never be used again, which makes for very poor cache performance. (Not to mention that buckets in the shape of n-cubes have volume that is O(2^n), which means probably they will span many disk pages and the program will have to scan through them all to find the entry it wants.)

If there were a way to partition n-space such that each cached page corresponds to the shape of an n-cross (an n-dimensional octahedron), whose volume grows only linearly with n, or some other such shape, this would allow the bucket size to be only O(n) rather than O(2^n), and furthermore exhibit better cache performance since the entries in the bucket would be in the region highly likely to be looked up by the algorithm next.


T

-- 
It always amuses me that Windows has a Safe Mode during bootup. Does that mean that Windows is normally unsafe?
December 18, 2013
On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
> On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:

>> Another question is can your queries be batched?  If that is the
>> case and your data is bigger than your available memory, then try
>> Googling "Lars Arge Buffer Tree" which might work well.
>
> Hmm. I'm not sure if it's possible to batch the queries unless I
> multithread the algorithm; the queries are generated on-the-fly. But
> they *are* somewhat predictable (spatial locality: see below), so maybe
> that might help?
>
I should clarify a bit what I meant by batched.  If you have a
sequence of queries q_1, q_2, .... q_n (where I guess in your
case a query can be/is an insertion), then by 'batch' proccess
what I mean is you do not need to get the answers on the fly. It
is sufficient to have all queries answered when the program
terminates?

Note that these techniques still account for the 'order' of the
queries, so the answer to query q_i will be reported correctly
(ie. same as if the queries were performed sequentially on a
standard data structure).


Also, if you are still considering using hashing the following
paper may be of interest to you (I haven't read it in detail, but
it doesn't look too crazy)

http://arxiv.org/abs/1104.2799

« First   ‹ Prev
1 2