Thread overview
Re: [OT] Best algorithm for extremely large hashtable?
Nov 15, 2013
H. S. Teoh
Nov 15, 2013
bearophile
Nov 15, 2013
qznc
Nov 15, 2013
H. S. Teoh
Nov 15, 2013
Ivan Kazmenko
Nov 15, 2013
John Colvin
Nov 15, 2013
H. S. Teoh
Nov 15, 2013
Peter Alexander
Nov 15, 2013
H. S. Teoh
November 15, 2013
On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:
> 15-Nov-2013 22:41, H. S. Teoh пишет:
> >This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask.
> >
> >I'm trying to implement an algorithm that traverses a very large
> >graph, and I need some kind of data structure to keep track of which
> >nodes have been visited, that (1) allows reasonably fast lookups
> >(preferably O(1)),
> 
> Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node).
[...]

There are too many nodes to keep in memory, actually. The algorithm also doesn't need to store the nodes; it can easily generate them on-the-fly when it needs them.  The challenge is when it needs to know whether a particular generated node has been visited before. I'd like to be able to have a fast lookup (hopefully O(1)) that tells me whether node X has been seen before. The problem is, there may have been a very large number of previously-seen nodes, so I need some way of storing this information in a space-efficient way.

I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes).


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
November 15, 2013
H. S. Teoh:

> I'm hoping for a kind of compressed representation where multiple
> adjacent nodes can be represented in a very compact way, by somehow
> taking advantage of the fact that they lie on an n-dimensional grid and
> the fact that graph edges are always a subset of grid edges (as opposed
> to edges between two arbitrary nodes).

Look for Succinct Data-Structures, Graphs. You can impose further memory savings in your problem.

Bye,
bearophile
November 15, 2013
On 11/15/13 11:20 AM, H. S. Teoh wrote:
> I'm hoping for a kind of compressed representation where multiple
> adjacent nodes can be represented in a very compact way, by somehow
> taking advantage of the fact that they lie on an n-dimensional grid and
> the fact that graph edges are always a subset of grid edges (as opposed
> to edges between two arbitrary nodes).

The classic trick here is parallel arrays - you store a bit array in parallel with an array of nodes.

Andrei


November 15, 2013
On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
> On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
> >so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid.
> 
> Could perhaps elaborate on this point?
> 
> As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives.

Actually, i can only be 1 or -1.


> How is this the same as the edges of an n-dimensional grid?

Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1. So for the 2D case, if you graph the nodes on the plane, the edges would be horizontal/vertical line segments of unit length. If you consider the 2D grid's edges to be *all* possible such line segments, then in that sense you could think of the graph's edges as being a subset of the 2D grid's.

I was hoping that this fact can be taken advantage of, to make a compact representation of visited nodes.


T

-- 
By understanding a machine-oriented language, the programmer will tend to use a much more efficient method; it is much closer to reality. -- D. Knuth
November 15, 2013
On Friday, 15 November 2013 at 19:21:35 UTC, H. S. Teoh wrote:
> On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:
>> 15-Nov-2013 22:41, H. S. Teoh пишет:
>> >This isn't directly related to D (though the code will be in D), and
>> >I thought this would be a good place to ask.
>> >
>> >I'm trying to implement an algorithm that traverses a very large
>> >graph, and I need some kind of data structure to keep track of which
>> >nodes have been visited, that (1) allows reasonably fast lookups
>> >(preferably O(1)),
>> 
>> Store as a bit-flag in whatever structure that serves as node. It's
>> going to be the fastest anyway and you indeed get to use only 1 bit
>> per node (and given padding and whatnot you may already have a couple
>> of bytes to spare per node).
> [...]
>
> There are too many nodes to keep in memory, actually. The algorithm also
> doesn't need to store the nodes; it can easily generate them on-the-fly
> when it needs them.  The challenge is when it needs to know whether a
> particular generated node has been visited before. I'd like to be able
> to have a fast lookup (hopefully O(1)) that tells me whether node X has
> been seen before. The problem is, there may have been a very large
> number of previously-seen nodes, so I need some way of storing this
> information in a space-efficient way.
>
> I'm hoping for a kind of compressed representation where multiple
> adjacent nodes can be represented in a very compact way, by somehow
> taking advantage of the fact that they lie on an n-dimensional grid and
> the fact that graph edges are always a subset of grid edges (as opposed
> to edges between two arbitrary nodes).

A Trie might be an idea, especially the bitwise variant.

http://en.wikipedia.org/wiki/Trie
November 15, 2013
On Fri, Nov 15, 2013 at 09:03:17PM +0100, qznc wrote:
> On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
[...]
> >I'm trying to implement an algorithm that traverses a very large
> >graph, and I need some kind of data structure to keep track of which
> >nodes have been visited, that (1) allows reasonably fast lookups
> >(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
> >some kind of compression would be nice).
> >
> >The graph nodes can be represented in various ways, but possibly the most convenient representation is as n-dimensional vectors of (relatively small) integers. Furthermore, graph edges are always between vectors that differ only by a single coordinate; so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid. The hashtable, therefore, needs to represent some connected subset of this grid in a space-efficient manner, that still allows fast lookup times.
> >
> >The naïve approach of using an n-dimensional bit array is not feasible because n can be quite large (up to 100 or so), and the size of the grid itself can get up to about 10 in each direction, so we're looking at a potential maximum size of 10^100, clearly impractical to store explicitly.
> 
> So, -10 to 10 in discrete steps. This means 5 bits per dimension and 500 bits for a single coordinate. Is the graph distributed of a compute cluster or does it fit into single computer? With a few GB of RAM, this means your graph is quite sparse, yet nodes are connected ("differ only by a single coordinate")?

The graph is not explicitly stored, because it is far too large. The nodes are generated on-the-fly as needed; the challenge is to know whether a particular generated node has been generated before. Basically, the equivalent of a hashtable of previously-visited nodes, but I'm hoping for (1) some kind of compressed representation so that as many nodes as possible can be kept in RAM for speed, and (2) something more efficient for disk I/O, because I'm expecting the number of generated nodes will not fit in RAM and will have to be put on disk (and hashtables don't do very well with disk I/O due to the latency).


> Can you preprocess? I mean, walk all the nodes O(n) to compute a
> good (perfect?) hash function.

Walking the nodes defeats the purpose, because the whole idea is to *not* have to generate every node, just those selected by the algorithm (those are already very unwieldy in number, nevermind generating *all* nodes).


> In general, I think you should either store the flag right in the graph node or mirror the graph structure.
[...]

That doesn't solve the problem when graph nodes are generated on-the-fly, since I wouldn't know where to look up the node structure without first knowing that it has been generated before.


T

-- 
Ph.D. = Permanent head Damage
November 15, 2013
On Fri, Nov 15, 2013 at 08:29:22PM +0100, Vladimir Panteleev wrote:
> On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
[...]
> >I'm trying to implement an algorithm that traverses a very large
> >graph, and I need some kind of data structure to keep track of which
> >nodes have been visited, that (1) allows reasonably fast lookups
> >(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
> >some kind of compression would be nice).
> 
> A while ago I set out to write a solver for a group of problems which can be described as traversing in breath extremely large implicit graphs. Some code here (C++): https://github.com/CyberShadow/DDD. The project uses delayed duplicate detection, to allow the number of nodes to exceed available RAM.
> 
> What we found was that in certain cases, delayed duplicate detection beat hash tables even while filtering duplicates in memory, I think mainly because DDD is more parallelizable than hash tables.
> 
> You mentioned compression - perhaps a bloom filter will fit your needs, as an optimization?

Thanks!! I think delayed duplicate detection is what I'm looking for. The bloom filter idea is an interesting one; I'll definitely consider it as a later optimization once I get DDD working.


T

-- 
It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
November 15, 2013
On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:
> The graph is not explicitly stored, because it is far too large. The
> nodes are generated on-the-fly as needed; the challenge is to know
> whether a particular generated node has been generated before.
> Basically, the equivalent of a hashtable of previously-visited nodes,
> but I'm hoping for (1) some kind of compressed representation so that as
> many nodes as possible can be kept in RAM for speed, and (2) something
> more efficient for disk I/O, because I'm expecting the number of
> generated nodes will not fit in RAM and will have to be put on disk (and
> hashtables don't do very well with disk I/O due to the latency).

Have multiple levels of lookup.

For example, let's say your entire grid G is 1,000,000 x 1,000,000.

Have a top level grid H that is 1000 x 1000. Each H[i, j] corresponds to a 1000 x 1000 region in G. If any cell has been visited in that region then H[i, j] points to a direct representation of that 1000 x 1000 subregion of G, if not, then it points to null. Checking if G[i, j] has been visited is as easy as:

visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000, j%1000];

You can easily page-in and page-out those 1000 x 1000 subregions to disk if needed, just keeping the active ones in memory. H will stay in memory (only 8MB). Obviously, you'd be better off using a power of 2 instead of 1000 for fast div and mod.

And of course, you can have as many levels to this as you like.
November 15, 2013
On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
> On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
>> How is this the same as the edges of an n-dimensional grid?
>
> Basically, adjacent nodes differ only in a single coordinate, and that
> difference can only be 1 or -1. So for the 2D case, if you graph the
> nodes on the plane, the edges would be horizontal/vertical line segments
> of unit length. If you consider the 2D grid's edges to be *all* possible
> such line segments, then in that sense you could think of the graph's
> edges as being a subset of the 2D grid's.
>
> I was hoping that this fact can be taken advantage of, to make a compact
> representation of visited nodes.

How dense is the graph?

For example, if it contains every possible edge described (+-1 to every single coordinate), for a breadth-first search, we can manage to just keep track of a single integer: the farthest distance traveled.

For very dense graphs, you can perhaps apply a similar idea: represent large visited areas as "diamonds" with center and radius, then try to "factor" the visited areas into diamonds on-the-fly.  Possibly they will be "diamonds with a [much shorter] list of holes".  For example, we say that all nodes not further than 3 from (0, 0, ..., 0) are visited, and store a single center (origin) and a radius (3) instead of all (dimensions[100] to the power of distance[3]) of them.  The lookup will be at most O (number-of-diamonds) plus a hash table lookup for the individual nodes.  Perhaps for certain graphs, we can find a good compromise between the number of diamonds and the number of individual stored nodes.

The above is just an idea, perhaps it won't be feasible by itself when you get to the details, but can inspire some related optimization.

Also, the size of the border of n-dimensional area is a (n-1)-dimensional object, and for dense enough graphs, you can hope that the number of elements in it is less by an order of magnitude than the total number of visited nodes.  However, for too much dimensions (as in your case, 10^100 -> 10^99), it does not seem to help much.

Another question is, when will the graph traversal end?  For example, if you are certain that you won't need to visit more than say one million nodes, a simple hash table storing the node representations at the hash indices will suffice.  On the other hand, if you plan to visit 10^12 nodes, and the graph is not very sparse or very dense (and not regular in any obvious way besides what is described), perhaps you won't get the required compression level (1/1000) anyway.

Ivan Kazmenko.
November 15, 2013
On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
> On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
>> On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
>> >so the edges of the graph may be thought of as a subset of the
>> >edges of an n-dimensional grid.
>> 
>> Could perhaps elaborate on this point?
>> 
>> As far as I can see, the possible values of the nodes are all the
>> points in N^n (or Z^n, dunno whether you have negatives) and the
>> edges are all in the set of i*e_j for i in Z and j in N, where e_j
>> is the unit vector along one axis, i.e. the edges are the euclidean
>> basis vectors and their negatives.
>
> Actually, i can only be 1 or -1.
>
>
>> How is this the same as the edges of an n-dimensional grid?
>
> Basically, adjacent nodes differ only in a single coordinate, and that
> difference can only be 1 or -1.

Ah, ok.

> So for the 2D case, if you graph the
> nodes on the plane, the edges would be horizontal/vertical line segments
> of unit length. If you consider the 2D grid's edges to be *all* possible
> such line segments, then in that sense you could think of the graph's
> edges as being a subset of the 2D grid's.

If you're going to chuck away the orthogonal location of the edges and only consider parallel position then you might as well go all the way:

In the 2-D example, the edges are all +/-(0,1) and +/-(1,0). There's no difference between the edge connecting (x, p) to (x, p+1) and the edge connecting (x, q) to (x, q+1). In the general case there's only 2N possible unique edges.