May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> Given a set A of n elements (let's say it's a random-access range of
> size n, where n is relatively large), and a predicate P(x) that
> specifies some subset of A of elements that we're interested in, what's
> the best algorithm (in terms of big-O time complexity) for selecting a
> random element x satisfying P(x), such that elements that satisfy P(x)
> have equal probability of being chosen? (Elements that do not satisfy
> P(x) are disregarded.)

I'd like to note here that, if you make use of the same P(x) many times (instead of different predicates on each call), it makes sense to spend O(n) time and memory filtering by that predicate and storing the result, and then answer each query in O(1).

> 3) Permutation walk:
>
> 	auto r = ... /* elements of A */
> 	foreach (i; iota(0 .. r.length).randomPermutation) {
> 		if (P(r[i])) return r[i];
> 	}
> 	/* no elements satisfy P(x) */
>
>    Advantages: if an element that satisfies P(x) is found early, the
>    loop will terminate before n iterations. This seems like the best of
>    both worlds of (1) and (2), except:
>
>    Disadvantages: AFAIK, generating a random permutation of indices from
>    0 .. n requires at least O(n) time, so any advantage we may have had
>    seems to be negated.
>
> Is there an algorithm for *incrementally* generating a random permutation of indices? If there is, we could use that in (3) and thus achieve the best of both worlds between early termination if an element satisfying P(x) is found, and guaranteeing termination after n iterations if no elements satisfying P(x) exists.

Yes, there is.

There are actually two variations of Fisher-Yates shuffle:
(https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)

1.
    auto p = n.iota.array;
    foreach (pos; 0..n) {
        auto otherPos = uniform (0, pos + 1);
        swap (p[pos], p[otherPos]);
    }

When we look at this after k-th iteration, the first k elements are randomly and uniformly permuted, and the rest (n-k) are left untouched.

2.
    auto p = n.iota.array;
    foreach (pos; 0..n) {
        auto otherPos = uniform (pos, n);
        swap (p[pos], p[otherPos]);
    }

When we look at this after k-th iteration, the first k elements are a random combination of all n elements, and this combination is randomly and uniformly permuted.  So, the second variation is what we need: each new element is randomly and uniformly selected from all the elements left.  Once we get the first element satisfying the predicate, we can just terminate the loop.  If there are m out of n elements satisfying the predicate, the average number of steps is n/m.

Now, the problem is that both of these allocate n "size_t"-s of memory to start with.  And your problem does not allow to shuffle the elements of the original array in place, so we do need an external permutation for these algorithms.  However, there are at least two ways to mitigate that:

(I)
We can allocate the permutation once using n time and memory, and then, on every call, just reuse it in its current state in n/m time.  It does not matter if the permutation is not identity permutation: by symmetry argument, any starting permutation will do just fine.

(II)
We can store the permutation p in an associative array instead of a regular array, actually storing only the elements accessed at least once, and assuming other elements to satisfy the identity p[x] = x.  So, if we finish in n/m steps on average, the time and extra memory used will be O(n/m) too.  I can put together an example implementation if this best satisfies your requirements.

Ivan Kazmenko.

May 01, 2017
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
> Given a set A of n elements (let's say it's a random-access range of
> size n, where n is relatively large), and a predicate P(x) that
> specifies some subset of A of elements that we're interested in, what's
> the best algorithm (in terms of big-O time complexity) for selecting a
> random element x satisfying P(x), such that elements that satisfy P(x)
> have equal probability of being chosen? (Elements that do not satisfy
> P(x) are disregarded.)

Here's how I would do it:

// returns a random index of array satisfying P(x), -1 if not found
int randomlySatisfy(A[] array) {
   if (array.length == 0)
      return -1;
   int i = uniform(0, array.length);
   return randomlySatisfyImpl(array, i);
}

// recursive function
private int randomlySatisfyImpl(A[] da, int i) {
   if (P(da[i]))
      return i;
   if (da.length == 1)
      return -1;

   // choose a random partition proportionally
   auto j = uniform(da.length - 1);
   if (j < i) {
      // the lower partition
      int a = randomlySatisfyImpl(da[0..i], j);
      if (a != -1) return a;
      else return randomlySatisfyImpl(da[i+1 .. da.length], j - (i + 1));
   }
   else {
      // higher partition, investigate in reverse order
      int a = randomlySatisfyImpl(da[i+1 .. da.length], j - (i + 1));
      if (a != -1) return i +1 + a;
      else return i + 1 + randomlySatisfyImpl(da[0..i], j);
   }
}

The goal is to have the first hit be the one you return. The method: if a random pick doesn't satisfy, randomly choose the partition greater than or less than based on uniform(0..array.length-1), and do the same procedure on that partition, reusing the random index to avoid having to call uniform twice on each recursion (one to choose a partition and one to choose an index within that partition). If the probability of choosing a partition is precisely proportional to the number of elements in that partition, it seems to me that the result will be truly random, but I'm not sure.

May 01, 2017
On Monday, 1 May 2017 at 16:56:58 UTC, MysticZach wrote:
>    // choose a random partition proportionally
>    auto j = uniform(da.length - 1);
>    if (j < i) {
>       // the lower partition
>       int a = randomlySatisfyImpl(da[0..i], j);
>       if (a != -1) return a;
>       else return randomlySatisfyImpl(da[i+1 .. da.length], j - (i + 1));
>    }
>    else {
>       // higher partition, investigate in reverse order
>       int a = randomlySatisfyImpl(da[i+1 .. da.length], j - (i + 1));
>       if (a != -1) return i +1 + a;
>       else return i + 1 + randomlySatisfyImpl(da[0..i], j);

The line above has a bug. Replace it with:

       else {
          a = randomlySatisfyImpl(da[0..i], j);
          return (a == -1) ? -1 : i + 1 + a;
       }

>    }
> }

But the idea's the same. Hopefully it's clear.

May 01, 2017
On Monday, 1 May 2017 at 16:56:58 UTC, MysticZach wrote:
> The goal is to have the first hit be the one you return. The method: if a random pick doesn't satisfy, randomly choose the partition greater than or less than based on uniform(0..array.length-1), and do the same procedure on that partition, reusing the random index to avoid having to call uniform twice on each recursion (one to choose a partition and one to choose an index within that partition). If the probability of choosing a partition is precisely proportional to the number of elements in that partition, it seems to me that the result will be truly random, but I'm not sure.

Now I'm questioning this, because if the positive cases are unevenly distributed, i.e., [111111000100], the last one has about 50% chance to get picked instead of a 1 in 7 chance with my method. I guess you'd need to build a whole new array like the others are saying.
May 02, 2017
On Monday, 1 May 2017 at 21:54:43 UTC, MysticZach wrote:
> On Monday, 1 May 2017 at 16:56:58 UTC, MysticZach wrote:
>> The goal is to have the first hit be the one you return. The method: if a random pick doesn't satisfy, randomly choose the partition greater than or less than based on uniform(0..array.length-1), and do the same procedure on that partition, reusing the random index to avoid having to call uniform twice on each recursion (one to choose a partition and one to choose an index within that partition). If the probability of choosing a partition is precisely proportional to the number of elements in that partition, it seems to me that the result will be truly random, but I'm not sure.
>
> Now I'm questioning this, because if the positive cases are unevenly distributed, i.e., [111111000100], the last one has about 50% chance to get picked instead of a 1 in 7 chance with my method. I guess you'd need to build a whole new array like the others are saying.

A pity; it sounded nice!
But yeah, once we pick the right ~half, we have to completely traverse it before paying any attention to the left ~half.

I hope some part of the idea is still salvageable.
For example, what if we put the intervals in a queue instead of a stack?

Ivan Kazmenko.

May 02, 2017
On Tuesday, 2 May 2017 at 10:35:46 UTC, Ivan Kazmenko wrote:
> I hope some part of the idea is still salvageable.
> For example, what if we put the intervals in a queue instead of a stack?

I tried to implement a similar approach, but instead of a queue or a stack, I used a random-access array of intervals.

Sadly, it is still far from uniform, since it does not take interval lengths into account, and I don't see how to do that without at least O(log n) time per interval insertion or deletion.

Implementation and empiric frequencies for n=5 elements in a permutation: http://ideone.com/3zSxLN

Ivan Kazmenko.

May 02, 2017
On Tuesday, 2 May 2017 at 11:27:17 UTC, Ivan Kazmenko wrote:
> On Tuesday, 2 May 2017 at 10:35:46 UTC, Ivan Kazmenko wrote:
>> I hope some part of the idea is still salvageable.
>> For example, what if we put the intervals in a queue instead of a stack?
>
> I tried to implement a similar approach, but instead of a queue or a stack, I used a random-access array of intervals.
>
> Sadly, it is still far from uniform, since it does not take interval lengths into account, and I don't see how to do that without at least O(log n) time per interval insertion or deletion.
>
> Implementation and empiric frequencies for n=5 elements in a permutation: http://ideone.com/3zSxLN
>
> Ivan Kazmenko.

Well, I thought of another idea that may not be technically random, but which I imagine would get pretty close in real world use cases:

1. Test a random element in the array. If it satisfies P, return it.
2. Choose a hopper interval, h, that is relatively prime to the number of elements in the array, n. You could do this by randomly selecting from a pre-made list of primes of various orders of magnitude larger than n, with h = the prime % n.
3. Hop along the array, testing each element as you go. Increment a counter. If you reach the end of the array, cycle back to the beginning starting with the remainder of h that you didn't use. I think that h being relatively prime means it will thereby hit every element in the array.
4. Return the first hit. If the counter reaches n, return failure.

The way I see it, with a random start position, and a *mostly* random interval, the only way to defeat the randomness of the result would be if the elements that satisfy P were dispersed precisely according to the random interval specified for that particular evocation of the function — in which case the first in the dispersed set would be chosen more often. This would require a rare convergence depending on both n and h, and would not repeat from one call to the next.


May 02, 2017
On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
> 1. Test a random element in the array. If it satisfies P, return it.
> 2. Choose a hopper interval, h, that is relatively prime to the number of elements in the array, n. You could do this by randomly selecting from a pre-made list of primes of various orders of magnitude larger than n, with h = the prime % n.
> 3. Hop along the array, testing each element as you go. Increment a counter. If you reach the end of the array, cycle back to the beginning starting with the remainder of h that you didn't use. I think that h being relatively prime means it will thereby hit every element in the array.
> 4. Return the first hit. If the counter reaches n, return failure.

Taking this a step further, it occurred to me that you could use *any* hopping interval from 1 to array.length, if the length of the array were prime. So artificially extend the array and just skip a jump when you land in the extended area. And since you'd skip lots of jumps if the extension were any bigger that the hopping interval, reduce its length to modulo the hopping interval.

// returns a random index of array satisfying P(x), -1 if not found
int randomlySatisfy(A[] array) {
   auto length = array.length;
   if (!length)
      return -1;
   auto i = uniform(0, length);
   auto hop = uniform(1, length);

   // greatest prime < 2^^31, for simplicity
   enum prime = 2147483477;
   assert (length <= prime);

   auto skipLength = ((prime - length) % hop) + length;
   auto count = 0;

   for (;;) {
      if (P(a[i]))
         return i;
      ++count;
      if (count == length)
         return -1;
      i += hop;
      if (i < length)
         continue;
      if (i < skipLength)
         i += hop;
      i -= skipLength;
   }
   return -1;
}

This solution is stupidly simple and I haven't tested it. But I believe it's truly random, since the hopping interval is arbitrary. Who knows, it might work.

May 02, 2017
On 02.05.2017 22:42, MysticZach wrote:
> On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
>> 1. Test a random element in the array. If it satisfies P, return it.
>> 2. Choose a hopper interval, h, that is relatively prime to the number
>> of elements in the array, n. You could do this by randomly selecting
>> from a pre-made list of primes of various orders of magnitude larger
>> than n, with h = the prime % n.
>> 3. Hop along the array, testing each element as you go. Increment a
>> counter. If you reach the end of the array, cycle back to the
>> beginning starting with the remainder of h that you didn't use. I
>> think that h being relatively prime means it will thereby hit every
>> element in the array.
>> 4. Return the first hit. If the counter reaches n, return failure.
>
> Taking this a step further, it occurred to me that you could use *any*
> hopping interval from 1 to array.length, if the length of the array were
> prime. So artificially extend the array and just skip a jump when you
> land in the extended area. And since you'd skip lots of jumps if the
> extension were any bigger that the hopping interval, reduce its length
> to modulo the hopping interval.
>
> // returns a random index of array satisfying P(x), -1 if not found
> int randomlySatisfy(A[] array) {
>    auto length = array.length;
>    if (!length)
>       return -1;
>    auto i = uniform(0, length);
>    auto hop = uniform(1, length);
>
>    // greatest prime < 2^^31, for simplicity
>    enum prime = 2147483477;
>    assert (length <= prime);
>
>    auto skipLength = ((prime - length) % hop) + length;
>    auto count = 0;
>
>    for (;;) {
>       if (P(a[i]))
>          return i;
>       ++count;
>       if (count == length)
>          return -1;
>       i += hop;
>       if (i < length)
>          continue;
>       if (i < skipLength)
>          i += hop;

(I guess this should be 'while'.)

>       i -= skipLength;
>    }
>    return -1;
> }
>
> This solution is stupidly simple and I haven't tested it. But I believe
> it's truly random, since the hopping interval is arbitrary. Who knows,
> it might work.
>

Counterexample: [1,1,1,0,0]

Your algorithm returns 0 with probability 7/20, 1 with probability 6/20 and 2 with probability 7/20.

Note that there is a simple reason why your algorithm cannot work for this case: it picks one of 20 numbers at random. The resulting probability mass cannot be evenly distributed among the three elements, because 20 is not divisible by 3.
May 02, 2017
On Tuesday, 2 May 2017 at 21:00:36 UTC, Timon Gehr wrote:
> On 02.05.2017 22:42, MysticZach wrote:
>> On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
>>    for (;;) {
>>       if (P(a[i]))
>>          return i;
>>       ++count;
>>       if (count == length)
>>          return -1;
>>       i += hop;
>>       if (i < length)
>>          continue;
>>       if (i < skipLength)
>>          i += hop;
>
> (I guess this should be 'while'.)

skipLength is determined modulo hop, thus it can't be more than one hop away.

> Counterexample: [1,1,1,0,0]
>
> Your algorithm returns 0 with probability 7/20, 1 with probability 6/20 and 2 with probability 7/20.
>
> Note that there is a simple reason why your algorithm cannot work for this case: it picks one of 20 numbers at random. The resulting probability mass cannot be evenly distributed among the three elements, because 20 is not divisible by 3.

That's true. Two points, though: If the range of error is within 1/(n*(n-1)), with array length n, it may be close enough for a given real world use case. 2. n is known to be large, which makes 1/(n*(n-1)) even smaller. You'd probably have to trade accuracy for performance. I wonder how much less performant a truly accurate algorithm would be. Also, whether there's a way to make an algorithm of this style completely accurate.