Thread overview | ||||||||
---|---|---|---|---|---|---|---|---|
|
November 15, 2013 [OT] Best algorithm for extremely large hashtable? | ||||
---|---|---|---|---|
| ||||
This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask.
I'm trying to implement an algorithm that traverses a very large graph,
and I need some kind of data structure to keep track of which nodes have
been visited, that (1) allows reasonably fast lookups (preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of compression
would be nice).
The graph nodes can be represented in various ways, but possibly the most convenient representation is as n-dimensional vectors of (relatively small) integers. Furthermore, graph edges are always between vectors that differ only by a single coordinate; so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid. The hashtable, therefore, needs to represent some connected subset of this grid in a space-efficient manner, that still allows fast lookup times.
The naïve approach of using an n-dimensional bit array is not feasible because n can be quite large (up to 100 or so), and the size of the grid itself can get up to about 10 in each direction, so we're looking at a potential maximum size of 10^100, clearly impractical to store explicitly.
Are there any known good algorithms for tackling this problem?
Thanks!
T
--
Creativity is not an excuse for sloppiness.
|
November 15, 2013 Re: [OT] Best algorithm for extremely large hashtable? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | 15-Nov-2013 22:41, H. S. Teoh пишет: > This isn't directly related to D (though the code will be in D), and I > thought this would be a good place to ask. > > I'm trying to implement an algorithm that traverses a very large graph, > and I need some kind of data structure to keep track of which nodes have > been visited, that (1) allows reasonably fast lookups (preferably O(1)), Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node). -- Dmitry Olshansky |
November 15, 2013 Re: [OT] Best algorithm for extremely large hashtable? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote: > This isn't directly related to D (though the code will be in D), and I > thought this would be a good place to ask. > > I'm trying to implement an algorithm that traverses a very large graph, > and I need some kind of data structure to keep track of which nodes have > been visited, that (1) allows reasonably fast lookups (preferably O(1)), > and (2) doesn't require GB's of storage (i.e., some kind of compression > would be nice). A while ago I set out to write a solver for a group of problems which can be described as traversing in breath extremely large implicit graphs. Some code here (C++): https://github.com/CyberShadow/DDD. The project uses delayed duplicate detection, to allow the number of nodes to exceed available RAM. What we found was that in certain cases, delayed duplicate detection beat hash tables even while filtering duplicates in memory, I think mainly because DDD is more parallelizable than hash tables. You mentioned compression - perhaps a bloom filter will fit your needs, as an optimization? |
November 15, 2013 Re: [OT] Best algorithm for extremely large hashtable? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
> so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid.
Could perhaps elaborate on this point?
As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives.
How is this the same as the edges of an n-dimensional grid?
|
November 15, 2013 Re: [OT] Best algorithm for extremely large hashtable? | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Friday, 15 November 2013 at 19:48:21 UTC, John Colvin wrote:
> i.e. the edges are the euclidean basis vectors and their negatives.
sorry, I meant *integer multiples* of the euclidian basis vectors.
|
November 15, 2013 Re: [OT] Best algorithm for extremely large hashtable? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
> This isn't directly related to D (though the code will be in D), and I
> thought this would be a good place to ask.
>
> I'm trying to implement an algorithm that traverses a very large graph,
> and I need some kind of data structure to keep track of which nodes have
> been visited, that (1) allows reasonably fast lookups (preferably O(1)),
> and (2) doesn't require GB's of storage (i.e., some kind of compression
> would be nice).
>
> The graph nodes can be represented in various ways, but possibly the
> most convenient representation is as n-dimensional vectors of
> (relatively small) integers. Furthermore, graph edges are always between
> vectors that differ only by a single coordinate; so the edges of the
> graph may be thought of as a subset of the edges of an n-dimensional
> grid. The hashtable, therefore, needs to represent some connected subset
> of this grid in a space-efficient manner, that still allows fast lookup
> times.
>
> The naïve approach of using an n-dimensional bit array is not feasible
> because n can be quite large (up to 100 or so), and the size of the grid
> itself can get up to about 10 in each direction, so we're looking at a
> potential maximum size of 10^100, clearly impractical to store
> explicitly.
So, -10 to 10 in discrete steps. This means 5 bits per dimension and 500 bits for a single coordinate. Is the graph distributed of a compute cluster or does it fit into single computer? With a few GB of RAM, this means your graph is quite sparse, yet nodes are connected ("differ only by a single coordinate")?
Can you preprocess? I mean, walk all the nodes O(n) to compute a good (perfect?) hash function.
In general, I think you should either store the flag right in the graph node or mirror the graph structure.
I do not know any concrete algorithms.
|
Copyright © 1999-2021 by the D Language Foundation