Cantor tuple function, BigInt, and integer overflow

December 20, 2012
Re: Cantor tuple function, BigInt, and integer overflow
Posted by H. S. Teoh
Permalink
H. S. Teoh
Permalink
On Thu, Dec 20, 2012 at 07:46:39PM +0100, Joseph Rushton Wakeling wrote:
> Hello all,
> 
> I've been playing around trying to implement the Cantor tuple function (see attached code), which is essentially a way to map a sequence x1, x2, ..., xN of non-negative integers to a single unique number.  See: https://en.wikipedia.org/wiki/Pairing_function#Cantor_pairing_function
> 
> The inspiration came from a need to check pairs of numbers for uniqueness.

This is an interesting problem. It is related to the computation of the Cartesian product of infinite ranges: your unique number is simply the index of the given pair in the range over the N-ary Cartesian product of the natural numbers.  Of course, it's very inefficient to compute the Cartesian product just to find the index of a particular sequence, so a direct mapping is desirable.

The simplest method, mathematically speaking, is to make use of the Prime Factorization theorem, that every non-negative integer is the product of a unique set of primes powers. So to find the index of a sequence, simply assign consecutive primes to them, say 2, 3, 5, 7, 11, 13, ... up to the N'th prime. Then the index is given by:

	I = 2^x1 * 3^x2 * 5^x3 * ... * pN^xN

This index is mathematically guaranteed to be unique for every sequence. However, the problem with this is that it tends to produce very, VERY large numbers, and is rather expensive to compute to boot.

Given that we're working on computers where the bit width of x1...xN is fixed (currently 32-bit or 64-bit), we can do better. If we concatenate the elements of the sequence to form a 32*N (or 64*N) bit BigInt, that also gives us a unique index for the sequence. Assuming that the numbers in your sequence cover a good part of the range of a 32-bit (or 64-bit) integer, this encoding produces much smaller numbers than the prime product method.

But then, concatenation just means that you might as well just store the sequence as-is (since that's what concatenation amounts to!), and make comparisons based on that.

However, you probably want to have a faster means of comparison. And so this is where hashtables come in: if you only have a relatively small number of sequences (compared with the potentially wide range of values of elements in the sequence and wide variance of sequence length), then computing huge unique indices for them is overkill. A more efficient method is to compute a hash for the sequence that has a low collision rate, and for colliding entries just keep a simple list of some sort containing the actual sequences, so that you can fallback to bitwise comparison.

IOW, bool[Sequence] may be all you need to solve your problem, depending on what you're trying to do.


T

-- 
I see that you JS got Bach.
Forums