June 03, 2013
On 06/03/2013 07:00 PM, Diggory wrote:
> Thanks for testing before dismissing completely :P The way it returns results can be improved a lot by pre-allocating a range of the necessary size/using a range passed in.

Surely. :-)

> The complexity is O(N²) where N is the number of samples out of M. The algorithm is something I just thought of although I've possibly used it before and I'm sure someone else has thought of it too. It's quite simple, it effectively says "pick a value from the range except for this value" recursively but instead of dividing the range in two, it shifts the top part down first to make a contiguous range and then shifts the results which are past the split up one afterward.

It's an interesting little piece of invention.  I'll have to look around and see if there's anything similar documented.  I've not seen anything in the literature which has this particular property of O(N) random variates plus O(N^2) calculation.

The fact that it's not sequential might have been a bit of an issue historically, which maybe explains why I haven't come across something like it.

> I expect the phobos implementation to be something like O(M) in which case my
> method would be faster whenever N < sqrt(M) (with the optimisations)

No, the Phobos implementation is O(N) -- I wrote it :-)  See:
http://braingam.es/2012/07/sampling-d/

The API is Andrei's -- the original implementation used what Knuth refers to as Algorithm S [which is really simple and easy to implement and check, but is indeed O(M)], I updated it to use Algorithm D (which is complicated:-).

So, since randomSample takes kN time where k is constant, your algorithm will probably win where N < k.

I found that the sampling time was about equal with N = 50 and your algorithm as it's currently written.
June 03, 2013
On Monday, 3 June 2013 at 17:35:22 UTC, Joseph Rushton Wakeling wrote:
> On 06/03/2013 07:00 PM, Diggory wrote:
>> Thanks for testing before dismissing completely :P The way it returns results
>> can be improved a lot by pre-allocating a range of the necessary size/using a
>> range passed in.
>
> Surely. :-)
>
>> The complexity is O(N²) where N is the number of samples out of M. The algorithm
>> is something I just thought of although I've possibly used it before and I'm
>> sure someone else has thought of it too. It's quite simple, it effectively says
>> "pick a value from the range except for this value" recursively but instead of
>> dividing the range in two, it shifts the top part down first to make a
>> contiguous range and then shifts the results which are past the split up one
>> afterward.
>
> It's an interesting little piece of invention.  I'll have to look around and see
> if there's anything similar documented.  I've not seen anything in the
> literature which has this particular property of O(N) random variates plus
> O(N^2) calculation.
>
> The fact that it's not sequential might have been a bit of an issue
> historically, which maybe explains why I haven't come across something like it.
>
>> I expect the phobos implementation to be something like O(M) in which case my
>> method would be faster whenever N < sqrt(M) (with the optimisations)
>
> No, the Phobos implementation is O(N) -- I wrote it :-)  See:
> http://braingam.es/2012/07/sampling-d/
>
> The API is Andrei's -- the original implementation used what Knuth refers to as
> Algorithm S [which is really simple and easy to implement and check, but is
> indeed O(M)], I updated it to use Algorithm D (which is complicated:-).
>
> So, since randomSample takes kN time where k is constant, your algorithm will
> probably win where N < k.
>
> I found that the sampling time was about equal with N = 50 and your algorithm as
> it's currently written.

I'd guess that the heavy use of floating point arithmetic to calculate the step sizes means that algorithm has a fairly large constant overhead even though the complexity is smaller.
June 03, 2013
On 06/03/2013 08:28 PM, Diggory wrote:
> I'd guess that the heavy use of floating point arithmetic to calculate the step sizes means that algorithm has a fairly large constant overhead even though the complexity is smaller.

Yes, I agree.  There might be some optimizations that could be done there.
June 04, 2013
On Monday, 3 June 2013 at 21:24:50 UTC, Joseph Rushton Wakeling wrote:
> On 06/03/2013 08:28 PM, Diggory wrote:
>> I'd guess that the heavy use of floating point arithmetic to calculate the step
>> sizes means that algorithm has a fairly large constant overhead even though the
>> complexity is smaller.
>
> Yes, I agree.  There might be some optimizations that could be done there.

Well I've come up with this new algorithm:


uint[] randomSample(uint N, uint M) {
	uint[] result = new uint[N];

	struct hashPair {
		uint key;
		uint index;
	}

	size_t tableSize = N*4;
	if (tableSize > M)
		tableSize = M;

	hashPair[] table = new hashPair[tableSize];

	for (uint i = 0; i < N; ++i) {
		uint v = (rndGen.front % (M-i))+i;
		uint newv = v;
		rndGen.popFront();

		uint vhash = v%tableSize;

		while (table[vhash].index) {
			if (table[vhash].key == v) {
				newv = table[vhash].index-1;
				table[vhash].index = i+1;
				goto done;
			}

			vhash = (vhash+1)%tableSize;
		}

		table[vhash].key = v;
		table[vhash].index = i+1;

done:
		result[i] = newv;
	}

	return result;
}


It's O(N) rather than O(N²). Conceptually it works by randomly shuffling the first N items in an array of size M which takes O(N) time but requires an array of size O(M), so to avoid this it uses a simple hash table to store just the items which have been swapped. Since only O(N) items are being shuffled there can only be O(N) swaps and so the hash table can be O(N) in size while still offering constant time look-up.
June 04, 2013
On Tuesday, 4 June 2013 at 08:30:58 UTC, Diggory wrote:
> On Monday, 3 June 2013 at 21:24:50 UTC, Joseph Rushton Wakeling wrote:
>> On 06/03/2013 08:28 PM, Diggory wrote:
>>> I'd guess that the heavy use of floating point arithmetic to calculate the step
>>> sizes means that algorithm has a fairly large constant overhead even though the
>>> complexity is smaller.
>>
>> Yes, I agree.  There might be some optimizations that could be done there.
>
> Well I've come up with this new algorithm:
>
>
> uint[] randomSample(uint N, uint M) {
> 	uint[] result = new uint[N];
>
> 	struct hashPair {
> 		uint key;
> 		uint index;
> 	}
>
> 	size_t tableSize = N*4;
> 	if (tableSize > M)
> 		tableSize = M;
>
> 	hashPair[] table = new hashPair[tableSize];
>
> 	for (uint i = 0; i < N; ++i) {
> 		uint v = (rndGen.front % (M-i))+i;
> 		uint newv = v;
> 		rndGen.popFront();
>
> 		uint vhash = v%tableSize;
>
> 		while (table[vhash].index) {
> 			if (table[vhash].key == v) {
> 				newv = table[vhash].index-1;
> 				table[vhash].index = i+1;
> 				goto done;
> 			}
>
> 			vhash = (vhash+1)%tableSize;
> 		}
>
> 		table[vhash].key = v;
> 		table[vhash].index = i+1;
>
> done:
> 		result[i] = newv;
> 	}
>
> 	return result;
> }
>
>
> It's O(N) rather than O(N²). Conceptually it works by randomly shuffling the first N items in an array of size M which takes O(N) time but requires an array of size O(M), so to avoid this it uses a simple hash table to store just the items which have been swapped. Since only O(N) items are being shuffled there can only be O(N) swaps and so the hash table can be O(N) in size while still offering constant time look-up.

OK, posted too soon! There's a small bug in this one where it could potentially return the same value twice, it can be fixed by looking up "i" in the hash table as well as "v".
June 04, 2013
Here's the fixed one:

uint[] randomSample(uint N, uint M) {
	uint[] result = new uint[N];

	struct hashPair {
		uint key;
		uint index;
	}

	size_t tableSize = N*4;
	if (tableSize > M)
		tableSize = M;

	hashPair[] table = new hashPair[tableSize];

	for (uint i = 0; i < N; ++i) {
		uint v = (rndGen.front % (M-i))+i;
		uint newv = v, newi = i;
		rndGen.popFront();

		uint ihash = i%tableSize;
		while (table[ihash].index) {
			if (table[ihash].key == i) {
				newi = table[ihash].index-1;
				break;
			}
		}

		uint vhash = v%tableSize;

		while (table[vhash].index) {
			if (table[vhash].key == v) {
				newv = table[vhash].index-1;
				table[vhash].index = newi+1;
				goto done;
			}

			vhash = (vhash+1)%tableSize;
		}

		table[vhash].key = v;
		table[vhash].index = newi+1;

done:
		result[i] = newv;
	}

	return result;
}


Still a few places to optimise, and I think the compiler optimisation should be able to give a decent speed up as well. Would be interested to see how it compares in your benchmark!
June 04, 2013
On 06/04/2013 02:03 PM, Diggory wrote:
> Still a few places to optimise, and I think the compiler optimisation should be able to give a decent speed up as well. Would be interested to see how it compares in your benchmark!

I'll try it out later and get back to you. :-)
June 20, 2013
On 06/04/2013 01:03 PM, Diggory wrote:
> Still a few places to optimise, and I think the compiler optimisation should be able to give a decent speed up as well. Would be interested to see how it compares in your benchmark!

VERY belated apologies for not getting back to you here.  I was preparing an
email on the nice features of RandomSample -- e.g. that you can use it to sample
lines from a file -- when I realized that actually there was a bug that stopped
you from doing that:
http://d.puremagic.com/issues/show_bug.cgi?id=10265

So I stopped to prepare a fix, and then I got distracted with a bunch of things I realized could be done to improve std.random ... :-P

Anyway, I _will_ do the benchmark.  Just might be a little while.
June 21, 2013
On Thursday, 20 June 2013 at 21:48:49 UTC, Joseph Rushton Wakeling wrote:
> On 06/04/2013 01:03 PM, Diggory wrote:
>> Still a few places to optimise, and I think the compiler optimisation should be
>> able to give a decent speed up as well. Would be interested to see how it
>> compares in your benchmark!
>
> VERY belated apologies for not getting back to you here.  I was preparing an
> email on the nice features of RandomSample -- e.g. that you can use it to sample
> lines from a file -- when I realized that actually there was a bug that stopped
> you from doing that:
> http://d.puremagic.com/issues/show_bug.cgi?id=10265
>
> So I stopped to prepare a fix, and then I got distracted with a bunch of things
> I realized could be done to improve std.random ... :-P
>
> Anyway, I _will_ do the benchmark.  Just might be a little while.

Haha, no problem!
1 2
Next ›   Last »