Thread overview
Re: [OT] Algorithm question
May 01, 2017
H. S. Teoh
May 01, 2017
H. S. Teoh
May 01, 2017
H. S. Teoh
May 01, 2017
H. S. Teoh
May 01, 2017
H. S. Teoh
April 30, 2017
On Mon, May 01, 2017 at 05:03:17AM +0000, Jack Stouffer via Digitalmars-d wrote:
> On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> > 2) "Turtle walk":
> 
> I'd just like to point out that this algorithm doesn't satisfy
> 
> > such that elements that satisfy P(x) have equal probability of being
> > chosen
> 
> as the first element found in A will be chosen, and then each subsequent element has a decreasing probability of replacing the first element of P = 1/nSatisfy.

Actually, this is exactly why it *does* satisfy the requirement.

If there is exactly 1 element that satisfies P(x), then the probability
of it being picked is 1.

If there are 2 elements, the first element is picked with probability 1 initially, but the 2nd element has a 1/2 chance of replacing it, meaning the overall probability distribution is 1/2 and 1/2.

If there are 3 elements, then by the time the 2nd element is processed, the first 2 elements are equally likely to be currently chosen (with probability 1/2 for each); when the 3rd element is found, it has 1/3 chance of replacing the previous choice, meaning there is a 2/3 chance the previous choice prevails. In that case, the 2/3 chance is equally distributed between the first two elements (they were picked with 1/2 and 1/2 probability), so the effective probability is 1/3 each after the 3rd element is processed.

In general, by the time you process the n'th element, the previous
elements would have had been picked with 1/(n-1) chance each, and the
n'th element would be picked with 1/n chance, meaning there is a (n-1)/n
chance the previous choice prevails. That (n-1)/n chance is equally
distributed among the previous (n-1) elements, so effectively they are
picked with 1/n chance each.

The key here is that the probability that the n'th element is *not* picked is exactly (n-1)/n, which, by inductive hypothesis, must be evenly distributed among the previous (n-1) candidates, thereby making their *eventual* probability of remaining as the chosen element exactly 1/n.


T

-- 
Two wrongs don't make a right; but three rights do make a left...
May 01, 2017
On Mon, May 01, 2017 at 08:42:05AM +0000, John Colvin via Digitalmars-d wrote: [...]
> You can recover O(n) calls to P by caching the results.

Sorry, I forgot to specify that P(x) can be different each time.

Also, the input A may also be different each time, albeit remain the same length (though the expectation is that it will be mostly the same -- not that it matters, though, I don't think any of the proposed solutions here count on that).


> You can recover O(n) for both P and rng by progressively removing elements from r or any other method that results in an easily samplable set of all elements of r not yet visited (and therefore possible candidates).
[...]

Yes, I realize that if r was mutable, there are existing simple solutions. The problem is that it's not mutable, or rather, should not be mutated by this selection algorithm.


[...]
> Yes. As a matter of fact it would be the logical extension of my algorithm above for when you can't or don't want to change r (again, only just thought of this, untested...):
> 
> auto r = ... /* elements of A */
> auto indices = iota(r.length).array;
> auto nRemaining = r.length;
> while (nRemaining) {
> 	auto i = uniform(0, nRemaining);
> 	if (P(r[indices[i]])) return r[indices[i]];
> 	else swap(indices[i], indices[--nRemaining]);
> }
> 
> You could also do the same thing but with an array of pointers to elements of r instead of indices.

The trouble with this is that you have to allocate an array of indices, and r.length is rather large, so creating this array is O(n) each time. But I think Ivan Kazmenko has an excellent idea to cache this index array. Since, by symmetry, it doesn't matter if the starting indices array was the identity permutation, we could just initialize this array once, and the next time just skip to the loop directly (reusing the previous (possibly partially) permuted indices as the starting point). It does require O(n) memory, which is unfortunate, but at least we only pay for that once.

I was hoping for something more along the lines of a random permutation algorithm with sublinear (if not O(1)) space complexity that can return elements of the permutation consecutively on-demand, like a lazy range. Something along the lines of:

	auto seed = ... /* get random number */
	auto lazyRange = generatePermutation(seed, n);

where generatePermutation is some incremental algorithm that produces a unique permutation of 0 .. n per seed value. Ideally, this algorithm would only do as much work as is necessary to produce the first k elements of the permutation, so that if the main loop terminates early (we found an element satisfying P(x)) we don't waste time generating the rest of the permutation.


T

-- 
"No, John.  I want formats that are actually useful, rather than over-featured megaliths that address all questions by piling on ridiculous internal links in forms which are hideously over-complex." -- Simon St. Laurent on xml-dev
May 01, 2017
On Mon, May 01, 2017 at 11:32:19AM +0000, Mike B Johnson via Digitalmars-d wrote: [...]
> Since most real world problems would require selecting elements more
> than once it may be far more efficient to sort by P(x)(filter) then
> simply select a random element.
> 
> e.g.,
> 
> a = A.filter(x->P(x))  // Creates a new set a where only the elements
> of A that satisfy P(x) are added
> 
> ...
> 
> e = a.random;
> 
> 
> this is O(1) if you only have to filter once(you can create a
> container that always "sorts" on P(x) so to speak.. like a sorted
> dictionary).
[...]

Sorry, I forgot to specify that P(x) may change each time, as may the input A. So caching would be of little help in this case.


T

-- 
This is a tpyo.
May 01, 2017
On Mon, May 01, 2017 at 12:31:48PM +0000, Andrea Fontana via Digitalmars-d wrote:
> On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> > 1) "Random shooting":
> > 
> > 	auto r = ... /* elements of A */
> > 	for (;;) {
> > 		auto i = uniform(0, r.length);
> > 		if (P(r[i])) return r[i];
> > 	}
> > 
> >    Advantages: If a large percentage of elements in A satisfy P(x), then
> >    the loop will terminate within a small number of iterations; if the
> >    majority of elements satisfy P(x), this will terminate well below n
> >    iterations.
> > 
> >    Disadvantages: If only a small percentage of elements satisfy P(x),
> >    then this loop could take arbitrarily long to terminate, and it will
> >    not terminate at all if no elements satisfy P(x).
> > 
> 
> If you find an element that doesn't satisfy P(x) move it on top of array
> (swap!) and then try again with uniform(1, r.length - 1) and so on.
> 
> It terminates in this way.
[...]

The problem is that swapping elements is not allowed for this particular problem.  If swapping was allowed this would be a much easier problem to solve. :-)


T

-- 
Chance favours the prepared mind. -- Louis Pasteur
May 01, 2017
On Mon, May 01, 2017 at 02:38:09PM +0000, Ivan Kazmenko via Digitalmars-d wrote:
> On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> > Given a set A of n elements (let's say it's a random-access range of
> > size n, where n is relatively large), and a predicate P(x) that
> > specifies some subset of A of elements that we're interested in,
> > what's the best algorithm (in terms of big-O time complexity) for
> > selecting a random element x satisfying P(x), such that elements
> > that satisfy P(x) have equal probability of being chosen? (Elements
> > that do not satisfy P(x) are disregarded.)
> 
> I'd like to note here that, if you make use of the same P(x) many
> times (instead of different predicates on each call), it makes sense
> to spend O(n) time and memory filtering by that predicate and storing
> the result, and then answer each query in O(1).

Unfortunately, P(x) could be different each time (sorry, should have stated this explicitly), and A may also change in between calls to this random selection algorithm.  So filtering would not be a good approach.


[...]
> There are actually two variations of Fisher-Yates shuffle: (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)
> 
> 1.
>     auto p = n.iota.array;
>     foreach (pos; 0..n) {
>         auto otherPos = uniform (0, pos + 1);
>         swap (p[pos], p[otherPos]);
>     }
> 
> When we look at this after k-th iteration, the first k elements are randomly and uniformly permuted, and the rest (n-k) are left untouched.
> 
> 2.
>     auto p = n.iota.array;
>     foreach (pos; 0..n) {
>         auto otherPos = uniform (pos, n);
>         swap (p[pos], p[otherPos]);
>     }
> 
> When we look at this after k-th iteration, the first k elements are a random combination of all n elements, and this combination is randomly and uniformly permuted.  So, the second variation is what we need: each new element is randomly and uniformly selected from all the elements left.  Once we get the first element satisfying the predicate, we can just terminate the loop.  If there are m out of n elements satisfying the predicate, the average number of steps is n/m.

But wouldn't creating the array p already be O(n)? Albeit, somewhat faster than n iterations of the main loop since copying iota ought to be somewhat cheaper than generating a random number and swapping two elements.


> Now, the problem is that both of these allocate n "size_t"-s of memory to start with.  And your problem does not allow to shuffle the elements of the original array in place, so we do need an external permutation for these algorithms.  However, there are at least two ways to mitigate that:
> 
> (I)
> We can allocate the permutation once using n time and memory, and
> then, on every call, just reuse it in its current state in n/m time.
> It does not matter if the permutation is not identity permutation: by
> symmetry argument, any starting permutation will do just fine.

I like this idea very much. This lets us only pay once for creating the index array, and reuse it almost "for free" thereafter.

Still, the O(n) space requirement could potentially be onerous, since n is expected to be rather large.


> (II)
> We can store the permutation p in an associative array instead of a
> regular array, actually storing only the elements accessed at least
> once, and assuming other elements to satisfy the identity p[x] = x.
> So, if we finish in n/m steps on average, the time and extra memory
> used will be O(n/m) too.  I can put together an example implementation
> if this best satisfies your requirements.
[...]

This is interesting, and relaxes the O(n) space requirement initially.
After enough queries, though, I'd expect the AA to eventually grow to n
elements (as there is a chance some queries will end up finding no
elements that satisfy P(x) so it would generate the entire permutation).

But I like your idea of reusing the index array.  It's definitely a possibility!


T

-- 
Your inconsistency is the only consistent thing about you! -- KD