Jump to page: 1 24  
Page
Thread overview
[OT] Algorithm question
May 01, 2017
H. S. Teoh
May 01, 2017
Jack Stouffer
May 01, 2017
John Colvin
May 01, 2017
Moritz Maxeiner
May 01, 2017
Timon Gehr
May 01, 2017
Moritz Maxeiner
May 01, 2017
Mike B Johnson
May 01, 2017
Andrea Fontana
May 01, 2017
Andrea Fontana
May 01, 2017
Ivan Kazmenko
May 01, 2017
MysticZach
May 01, 2017
MysticZach
May 01, 2017
MysticZach
May 02, 2017
Ivan Kazmenko
May 02, 2017
Ivan Kazmenko
May 02, 2017
MysticZach
May 02, 2017
MysticZach
May 02, 2017
Timon Gehr
May 02, 2017
MysticZach
May 03, 2017
MysticZach
May 04, 2017
Timon Gehr
May 04, 2017
MysticZach
May 04, 2017
MysticZach
May 12, 2017
H. S. Teoh
April 30, 2017
I'm been thinking about the following problem, and thought I'd pick the brains of the bright people around these parts...

Given a set A of n elements (let's say it's a random-access range of
size n, where n is relatively large), and a predicate P(x) that
specifies some subset of A of elements that we're interested in, what's
the best algorithm (in terms of big-O time complexity) for selecting a
random element x satisfying P(x), such that elements that satisfy P(x)
have equal probability of being chosen? (Elements that do not satisfy
P(x) are disregarded.)

Which elements of A satisfy P(x) is not known ahead of time, nor is the
relative proportion of elements that satisfy P(x) or not.

Note that A is considered to be immutable (swapping elements or
otherwise changing A is not allowed).

So far, I have come up with the following algorithms:

1) "Random shooting":

	auto r = ... /* elements of A */
	for (;;) {
		auto i = uniform(0, r.length);
		if (P(r[i])) return r[i];
	}

   Advantages: If a large percentage of elements in A satisfy P(x), then
   the loop will terminate within a small number of iterations; if the
   majority of elements satisfy P(x), this will terminate well below n
   iterations.

   Disadvantages: If only a small percentage of elements satisfy P(x),
   then this loop could take arbitrarily long to terminate, and it will
   not terminate at all if no elements satisfy P(x).

2) "Turtle walk":

	auto r = ... /* elements of A */
	int nSatisfy = 0;
	ElementType!A currentChoice;
	bool found = false;
	foreach (e; r) {
		if (P(e)) {
			if (uniform(0, nSatisfy) == 0)
			{
				currentChoice = e;
				found = true;
			}
			nSatisfy++;
		}
	}
	if (found) return currentChoice;
	else ... /* no elements satisfy P(x) */

   Advantages: If there is any element at all in A that satisfies P(x),
   it will be found. The loop always terminates after n iterations.

   Disadvantages: Even if 99% of elements in A satisfies P(x), we still
   have to traverse the entire data set before we terminate. (We cannot
   terminate before that, otherwise the probability of elements
   satisfying P(x) being selected will not be even.)

3) Permutation walk:

	auto r = ... /* elements of A */
	foreach (i; iota(0 .. r.length).randomPermutation) {
		if (P(r[i])) return r[i];
	}
	/* no elements satisfy P(x) */

   Advantages: if an element that satisfies P(x) is found early, the
   loop will terminate before n iterations. This seems like the best of
   both worlds of (1) and (2), except:

   Disadvantages: AFAIK, generating a random permutation of indices from
   0 .. n requires at least O(n) time, so any advantage we may have had
   seems to be negated.

Is there an algorithm for *incrementally* generating a random permutation of indices? If there is, we could use that in (3) and thus achieve the best of both worlds between early termination if an element satisfying P(x) is found, and guaranteeing termination after n iterations if no elements satisfying P(x) exists.


T

-- 
The early bird gets the worm. Moral: ewww...
May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> 2) "Turtle walk":

I'd just like to point out that this algorithm doesn't satisfy

>such that elements that satisfy P(x) have equal probability of being chosen

as the first element found in A will be chosen, and then each subsequent element has a decreasing probability of replacing the first element of P = 1/nSatisfy.
May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> Which elements of A satisfy P(x) is not known ahead of time, nor is the
> relative proportion of elements that satisfy P(x) or not.

O(N) given P(x)===false for all x...

What you want is probably average case analysis, not worst case?

May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> I'm been thinking about the following problem, and thought I'd pick the brains of the bright people around these parts...
>
> Given a set A of n elements (let's say it's a random-access range of
> size n, where n is relatively large), and a predicate P(x) that
> specifies some subset of A of elements that we're interested in, what's
> the best algorithm (in terms of big-O time complexity) for selecting a
> random element x satisfying P(x), such that elements that satisfy P(x)
> have equal probability of being chosen? (Elements that do not satisfy
> P(x) are disregarded.)
>
> Which elements of A satisfy P(x) is not known ahead of time, nor is the
> relative proportion of elements that satisfy P(x) or not.
>
> Note that A is considered to be immutable (swapping elements or
> otherwise changing A is not allowed).
>
> So far, I have come up with the following algorithms:
>
> 1) "Random shooting":
>
> 	auto r = ... /* elements of A */
> 	for (;;) {
> 		auto i = uniform(0, r.length);
> 		if (P(r[i])) return r[i];
> 	}
>
>    Advantages: If a large percentage of elements in A satisfy P(x), then
>    the loop will terminate within a small number of iterations; if the
>    majority of elements satisfy P(x), this will terminate well below n
>    iterations.
>
>    Disadvantages: If only a small percentage of elements satisfy P(x),
>    then this loop could take arbitrarily long to terminate, and it will
>    not terminate at all if no elements satisfy P(x).

You can recover O(n) calls to P by caching the results. You can recover O(n) for both P and rng by progressively removing elements from r or any other method that results in an easily samplable set of all elements of r not yet visited (and therefore possible candidates).

I like this one (warning, just thought of, untested):

auto r = ... /* elements of A */
auto nRemaining = r.length;
while (nRemaining) {
	auto i = uniform(0, nRemaining);
	if (P(r[i])) return r[i];
	else swap(r[i], r[--nRemaining]);
}

> 3) Permutation walk:
>
> 	auto r = ... /* elements of A */
> 	foreach (i; iota(0 .. r.length).randomPermutation) {
> 		if (P(r[i])) return r[i];
> 	}
> 	/* no elements satisfy P(x) */
>
>    Advantages: if an element that satisfies P(x) is found early, the
>    loop will terminate before n iterations. This seems like the best of
>    both worlds of (1) and (2), except:
>
>    Disadvantages: AFAIK, generating a random permutation of indices from
>    0 .. n requires at least O(n) time, so any advantage we may have had
>    seems to be negated.
>
> Is there an algorithm for *incrementally* generating a random permutation of indices? If there is, we could use that in (3) and thus achieve the best of both worlds between early termination if an element satisfying P(x) is found, and guaranteeing termination after n iterations if no elements satisfying P(x) exists.
>
>
> T

Yes. As a matter of fact it would be the logical extension of my algorithm above for when you can't or don't want to change r (again, only just thought of this, untested...):

auto r = ... /* elements of A */
auto indices = iota(r.length).array;
auto nRemaining = r.length;
while (nRemaining) {
	auto i = uniform(0, nRemaining);
	if (P(r[indices[i]])) return r[indices[i]];
	else swap(indices[i], indices[--nRemaining]);
}

You could also do the same thing but with an array of pointers to elements of r instead of indices.
May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> Is there an algorithm for *incrementally* generating a random permutation of indices?

Does it exist? Yes, because you can build permutation tables for it, but you'll have to find the closed form for it that is efficient... Can you do that without selecting a specific N? It is easy for N=2 and N=3 at least.

E.g.

For array length 2 you get the 2 permutations: 0 1, 1 0
Then you just select one of them at random (you know the number of permutations)

So if it exists you'll should probably do a search for permutations. "How do I efficiently generate permutation number N"

Of course for smaller N you could generate a big array of bytes and compress it by having arrays of slices into it (as many permutations will share index sequences).

May 01, 2017
On Monday, 1 May 2017 at 08:42:05 UTC, John Colvin wrote:
> I like this one (warning, just thought of, untested):
>
> auto r = ... /* elements of A */
> auto nRemaining = r.length;
> while (nRemaining) {
> 	auto i = uniform(0, nRemaining);
> 	if (P(r[i])) return r[i];
> 	else swap(r[i], r[--nRemaining]);
> }


Yes, this is the standard text-book way of doing it, still O(N) of course. The other standard-text-book way is to generate an array of indexes and mutate that instead, still O(N). If you cache in a heap you get O(N log N).

Anyway, this kind of worst case analysis doesn't really help. And neither does average case, which will be O(1) assuming 50% match P.

So you need is to specify what kind of average case analysis you want. For instance expressed as f(N,S)  where N is the number of elements and S is the number of elements that satisfies P.


May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
>
> Given a set A of n elements (let's say it's a random-access range of
> size n, where n is relatively large), and a predicate P(x) that
> specifies some subset of A of elements that we're interested in, what's
> the best algorithm (in terms of big-O time complexity) for selecting a
> random element x satisfying P(x), such that elements that satisfy P(x)
> have equal probability of being chosen? (Elements that do not satisfy
> P(x) are disregarded.)
>
> Which elements of A satisfy P(x) is not known ahead of time, nor is the
> relative proportion of elements that satisfy P(x) or not.
>

Works only with a random access range and I haven't done the full analysis (and I'm a bit rusty on probability), but my first thought was random divide and conquer:

ElementType!A* select(A,P)(A r)
{
  // Recursion anchor
  if (r.length == 1) {
    if (P(r[0])) return r[0];
    else return null;
  // Recurse randomly with p=0.5 into either the left, or right half of r
  } else {
    ElementType!A* e;
    ElementType!A[][2] half;
    half[0] = r[0..floor(r.length/2)];
    half[1] = r[ceil(r.length/2)..$];

    ubyte coinFlip = uniform(0,1) > 0;
    // Recurse in one half and terminate if e is found there
    e = select(half[coinFlip]);
    if (e) return e;
    // Otherwise, recurse into other half
    return select(half[1 - coinFlip]);
  }
}

As stated above, I haven't done the full analysis, but intuitively speaking (which can be wrong, of course), the p=0.5 recursion ought satisfy the condition of elements satisfying P(x) being chosen uniformly; also intuitively, I'd expect the expected runtime for a uniform distribution of elements satisfying P(x) to be around O(log N).
Worst-case would be that it has to inspect every element in r once => O(N)
May 01, 2017
On 01.05.2017 11:51, Moritz Maxeiner wrote:
> On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
>>
>> Given a set A of n elements (let's say it's a random-access range of
>> size n, where n is relatively large), and a predicate P(x) that
>> specifies some subset of A of elements that we're interested in, what's
>> the best algorithm (in terms of big-O time complexity) for selecting a
>> random element x satisfying P(x), such that elements that satisfy P(x)
>> have equal probability of being chosen? (Elements that do not satisfy
>> P(x) are disregarded.)
>>
>> Which elements of A satisfy P(x) is not known ahead of time, nor is the
>> relative proportion of elements that satisfy P(x) or not.
>>
>
> Works only with a random access range and I haven't done the full
> analysis (and I'm a bit rusty on probability), but my first thought was
> random divide and conquer:
>
> ElementType!A* select(A,P)(A r)
> {
>   // Recursion anchor
>   if (r.length == 1) {
>     if (P(r[0])) return r[0];
>     else return null;
>   // Recurse randomly with p=0.5 into either the left, or right half of r
>   } else {
>     ElementType!A* e;
>     ElementType!A[][2] half;
>     half[0] = r[0..floor(r.length/2)];
>     half[1] = r[ceil(r.length/2)..$];
>
>     ubyte coinFlip = uniform(0,1) > 0;

This deterministically chooses 0. (uniform is right-exclusive.) I assume you mean uniform(0,2).

>     // Recurse in one half and terminate if e is found there
>     e = select(half[coinFlip]);
>     if (e) return e;
>     // Otherwise, recurse into other half
>     return select(half[1 - coinFlip]);
>   }
> }
>
> As stated above, I haven't done the full analysis, but intuitively
> speaking (which can be wrong, of course), the p=0.5 recursion ought
> satisfy the condition of elements satisfying P(x) being chosen
> uniformly;

Counterexample: [1,0,1,1].

The first element is chosen with probability 1/2, which is not 1/3.

> also intuitively, I'd expect the expected runtime for a
> uniform distribution of elements satisfying P(x)  to be around O(log N).

I don't understand this input model.

> Worst-case would be that it has to inspect every element in r once => O(N)


May 01, 2017
On Monday, 1 May 2017 at 09:58:39 UTC, Timon Gehr wrote:
> On 01.05.2017 11:51, Moritz Maxeiner wrote:
>>     [...]
>
> This deterministically chooses 0. (uniform is right-exclusive.) I assume you mean uniform(0,2).

Yes.

>
>> [...]
>
> Counterexample: [1,0,1,1].
>
> The first element is chosen with probability 1/2, which is not 1/3.

Dang. Serves me right for not doing the full analysis. Thanks for the counter example and sorry for wasting your time.

>
>> [...]
>
> I don't understand this input model.
>

Since the idea was nonsense, it doesn't matter, anyway.
May 01, 2017
On Monday, 1 May 2017 at 09:01:33 UTC, Ola Fosheim Grøstad wrote:
> array of indexes and mutate that instead, still O(N). If you cache in a heap you get O(N log N).

To avoid confusion: I didn't mean a literal "heap", but some kind of search structure that tracks and draws from unused indice ranges in a way that does not require O(N) initialization and can be implemented as a single array on the stack. I don't remember what it is called, but you draw on each node based on the number of children (kind of like the ones used in markov-chain type situations). I assume this can be maintained in O(log N) per iteration with O(1) initialization.

E.g.

1. push_indices(0,N-1) //data (0,N-1)
2. i = draw_and_remove_random_index() // data could now be (0,i-1), (i+1,N-1)
3. if( i == -1 ) return
4. dostuff(i)
5. goto 2


Even though it is O(N log N) it could outperform other solutions if you only want to draw a small number of elements. And a small array with a linear search (tracking used indices) would outperform that if you only want to draw say 8 elements. If you want to draw close to N elements then you would be better off just randomizing a full array (with elements from 0 to N-1). Or you could do all of these, start with linear search, then convert to a tree like search, then convert to full enumeration.

In general I don't think asymptotical analysis will do much good. I think you need to look on actual costs for typical N, P and different distributions of satisfied P, e.g.

cost of loop iteration: 10
average cost of P(x)=>false: 1
average cost of P(x)=>true: 1000
probability for P(a[i]) for each i

« First   ‹ Prev
1 2 3 4