Shuffle (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Announce » Shuffle (page 4)

January 27, 2008

Posted by 0ffh
in reply to Walter Bright

0ffh

Posted in reply to Walter Bright

Has it already been proposed to just throw files already
copied out of the list? It's a minimal amendment to the
original (changing one line and adding two):

/* Program to randomly copy music files from source to destination device.
 * Written in the D programming language.
 * Written by Walter Bright, http://www.digitalmars.com
 * Placed into the Public Domain.
 * Minimal amendment by 0ffh to copy any file <= once
 */

import std.file;
import std.stdio;
import std.string;
import std.c.stdlib;
import std.path;
import std.random;

int main(string[] args)
{
    if (args.length != 3)
    {    writefln("Usage: shuffle fromdir todir");
    exit(1);
    }
    auto fromdir = args[1];
    auto todir = args[2];

    /* Recursively search for all the mp3 and wma files in directory fromdir
     * and put them into files[]
     */
    string[] files;
    bool callback(DirEntry *de)
    {
    if (de.isdir)
        listdir(de.name, &callback); // recurse into subdirectories
    else
    {
        // Collect only files with mp3 and wma extensions
        auto ext = getExt(de.name);
        if (fnmatch(ext, "mp3") || fnmatch(ext, "wma"))
        files ~= de.name;
    }
    return true;    // keep going
    }
    std.file.listdir(fromdir, &callback);

    writefln(files.length, " music files");

    /* The loop will normally quit via an exception when the target device
     * is full. But if there are not enough files in the source to fill
     * up the target, the loop ensures it will still eventually quit.
     */
    while (files.length) // 0ffh: changed exit condition and loop type
    {
    auto j = std.random.rand() % files.length;
    auto fromfile = files[j];
    auto tofile = std.path.join(todir, basename(fromfile));
    writefln("%s => %s", fromfile, tofile);
    std.file.copy(fromfile, tofile);
    files[j]=files[$-1]; // 0ffh: copy last list element
    files.length=files.length-1; // 0ffh: decrease length
    }

    writefln("Done");
    return 0;
}

January 27, 2008

Posted by Walter Bright
in reply to Frits van Bommel

Walter Bright

Posted in reply to Frits van Bommel

Frits van Bommel wrote:
> Wrong again. The calculationof "max" should be inside the loop, since it depends on "n" which is different for each iteration.

Criminy!

January 27, 2008

Posted by Kris
in reply to Frits van Bommel

Kris

Posted in reply to Frits van Bommel

I hope tango.core.Array.shuffle got it right?

> Wrong again. The calculationof "max" should be inside the loop, since it depends on "n" which is different for each iteration.

January 27, 2008

Posted by Bill Baxter
in reply to Kris

Bill Baxter

Posted in reply to Kris

Kris wrote:
> I hope tango.core.Array.shuffle got it right?
> 
>> Wrong again. The calculationof "max" should be inside the loop, since it depends on "n" which is different for each iteration. 

It looks like Tango's shuffle doesn't attempt to eliminate the modulo bias in generating a random number.  So it doesn't have attempt to calculate any "n" or "max" to begin with.

--bb

January 28, 2008

Posted by Sean Kelly
in reply to Bill Baxter

Sean Kelly

Posted in reply to Bill Baxter

Bill Baxter wrote:
> Kris wrote:
>> I hope tango.core.Array.shuffle got it right?
>>
>>> Wrong again. The calculationof "max" should be inside the loop, since it depends on "n" which is different for each iteration.
> 
> It looks like Tango's shuffle doesn't attempt to eliminate the modulo bias in generating a random number.  So it doesn't have attempt to calculate any "n" or "max" to begin with.

The Tango shuffle uses the Knuth-Fischer-Yates shuffle algorithm but iterates from 0 .. N rather than N .. 0.  It may be that reversing the direction causes some bias and that it really has to be top-down however--I'm planning on reading up on the algorithm today to be sure. I actually stumbled on the current implementation by accident simply by following the requirements for random_shuffle in the C++ spec.

Sean

January 28, 2008

Posted by Frits van Bommel
in reply to Sean Kelly

Frits van Bommel

Posted in reply to Sean Kelly

Sean Kelly wrote:
> Bill Baxter wrote:
>> It looks like Tango's shuffle doesn't attempt to eliminate the modulo
>> bias in generating a random number.  So it doesn't have attempt to
>> calculate any "n" or "max" to begin with.
> 
> The Tango shuffle uses the Knuth-Fischer-Yates shuffle algorithm but
> iterates from 0 .. N rather than N .. 0.  It may be that reversing the
> direction causes some bias and that it really has to be top-down
> however--I'm planning on reading up on the algorithm today to be sure.
> I actually stumbled on the current implementation by accident simply by
> following the requirements for random_shuffle in the C++ spec.

It really needs to be top-down with that loop body.
The way tango.core.Array.shuffle is implemented[1], position n may be swapped with something else after having received its random replacement if n comes up as a random number afterwards.
Changing
    for( size_t pos = 2, end = buf.length; pos < buf.length; ++pos )
to
    for( size_t pos = buf.length - 1; pos > 0; --pos )
should fix it.

However, it's possible to implement a correct shuffle with an upwards-counting loop. The loop body should then be something like
    buf.swap( n, n + Rand(buf.length - n) );
(assuming 0 <= Rand(N) < N)
so that the random argument is always >= n. This is basically a reversed Knuth-Fischer-Yates shuffle.

Both versions would be better if the randomizer eliminated modulo bias, though :P.


[1]: At least, according to http://www.dsource.org/projects/tango/docs/current/source/tango.core.Array.html

January 28, 2008

Posted by Walter Bright
in reply to Frits van Bommel

Walter Bright

Posted in reply to Frits van Bommel

Frits van Bommel wrote:
> However, it's possible to implement a correct shuffle  [...]

std.random.randomShuffle() in Phobos 2.0 gets it right. The shuffle code can be replaced with:

version (D_Version2)
{
    auto rnd = Random(unpredictableSeed);
    randomShuffle(files, rnd);
}
else
{
    auto n = files.length;
    writefln(n, " music files");
    if (!n)
	return 0;

    /* Shuffle the files[] array using Durstenfeld's algorithm
     * based on the Fisher-Yates method
     */
    while (--n)
    {
	/* Pick random r in range 0..max, discarding others
	 * to eliminate modulo bias
	 */
	auto max = (typeof(std.random.rand()).max / (n + 1)) * (n + 1);
	auto r = max;
	do
	    r = std.random.rand();
	while (r >= max);

	auto j = r % (n + 1);	// pick element to swap with

	// swap [n] and [j]
	auto tmp = files[n];
	files[n] = files[j];
	files[j] = tmp;
    }
}

January 28, 2008

Posted by Sean Kelly
in reply to Frits van Bommel

Sean Kelly

Posted in reply to Frits van Bommel

Frits van Bommel wrote:
> Sean Kelly wrote:
>> Bill Baxter wrote:
>>> It looks like Tango's shuffle doesn't attempt to eliminate the modulo bias in generating a random number.  So it doesn't have attempt to calculate any "n" or "max" to begin with.
>>
>> The Tango shuffle uses the Knuth-Fischer-Yates shuffle algorithm but iterates from 0 .. N rather than N .. 0.  It may be that reversing the direction causes some bias and that it really has to be top-down however--I'm planning on reading up on the algorithm today to be sure. I actually stumbled on the current implementation by accident simply by following the requirements for random_shuffle in the C++ spec.
> 
> It really needs to be top-down with that loop body.
> The way tango.core.Array.shuffle is implemented[1], position n may be
> swapped with something else after having received its random replacement
> if n comes up as a random number afterwards.

Yup.  I just wasn't sure if this would be a problem, but it looks like it will be.  I also had an off-by-one error in the limit value being passed to rand, which actually turned it into Sattolo's algorithm.  Go figure.

> Both versions would be better if the randomizer eliminated modulo bias, though :P.

I just fixed that as well I think.  Let me know if I screwed it up:

http://dsource.org/projects/tango/changeset/3139

(A change to tango.stdc.posix.sys.stat creeped in as well.  Ignore that.)


Sean

January 28, 2008

Posted by Frits van Bommel
in reply to Sean Kelly

Frits van Bommel

Posted in reply to Sean Kelly

Sean Kelly wrote:
> Frits van Bommel wrote:
>> Both versions would be better if the randomizer eliminated modulo bias,
>> though :P.
> 
> I just fixed that as well I think.  Let me know if I screwed it up:
> 
> http://dsource.org/projects/tango/changeset/3139

Looks good, assuming libc's rand() has a uniform distribution over the lower 16 bits.

(This is probably just nitpicking:)
AFAICT that assumption isn't guaranteed by the C standard though, it seems RAND_MAX is only required to be >= 32767 (so rand() may well return 15-bit random numbers :( [1]).
Also, I can't find any mention of what the (lower-bitwise) distribution is supposed to be. For instance, if RAND_MAX happens to be (<some prime> - 1) instead of (2^N - 1) the lower bits won't be uniformly distributed even if the results of rand() are themselves distributed perfectly uniformly in [0 ... RAND_MAX].
For these reasons, it may not be a very good idea to use libc rand(). It's just underspecified AFAICT...

(I don't have the actual C standard though, just some library reference sites from my bookmarks and what Google drags up)

[1] For glibc RAND_MAX is 2^32-1 though, so this shouldn't be a problem on typical *nix systems. I'm not sure about Windows systems though; IIRC mingw (and presumably gdcwin) uses the MS libc and DMD uses its own libc...

January 28, 2008

Posted by Sean Kelly
in reply to Frits van Bommel

Sean Kelly

Posted in reply to Frits van Bommel

Frits van Bommel wrote:
> Sean Kelly wrote:
>> Frits van Bommel wrote:
>>> Both versions would be better if the randomizer eliminated modulo bias, though :P.
>>
>> I just fixed that as well I think.  Let me know if I screwed it up:
>>
>> http://dsource.org/projects/tango/changeset/3139
> 
> Looks good, assuming libc's rand() has a uniform distribution over the
> lower 16 bits.

Cool, thanks.

> (This is probably just nitpicking:)
> AFAICT that assumption isn't guaranteed by the C standard though, it
> seems RAND_MAX is only required to be >= 32767 (so rand() may well
> return 15-bit random numbers :( [1]).
> Also, I can't find any mention of what the (lower-bitwise) distribution
> is supposed to be. For instance, if RAND_MAX happens to be (<some prime>
> - 1) instead of (2^N - 1) the lower bits won't be uniformly distributed
> even if the results of rand() are themselves distributed perfectly
> uniformly in [0 ... RAND_MAX].
> For these reasons, it may not be a very good idea to use libc rand().
> It's just underspecified AFAICT...

Yup.  I really just use C's rand as a default because it's available and doing so avoids creating a dependency on tango.math.Random.  I half considered just requiring the user to supply a randomizer, but I think people would complain.  Still, I'd expect anyone who was really serious about having a uniform distribution would supply one--I intend to add an adapter to the Random module to simplify plugging it into the shuffle routine to simplify this.

> (I don't have the actual C standard though, just some library reference sites from my bookmarks and what Google drags up)

I have it, here's what it says:

"The rand function computes a sequence of pseudo-random integers in the range 0 to RAND_MAX."

Not much there, huh? :-)  It's also worth mentioning that C's rand function isn't required to be threadsafe, so it's not a wonderful choice for a number of reasons.

> [1] For glibc RAND_MAX is 2^32-1 though, so this shouldn't be a problem
> on typical *nix systems. I'm not sure about Windows systems though; IIRC
> mingw (and presumably gdcwin) uses the MS libc and DMD uses its own libc...

The current impl in Tango assumes only 16 bits of randomness, so rand is called twice to produce a 32-bit value.  I thought about switching based on RAND_MAX, but that would require it to be up to date for all OSes, so being conservative just seemed easier.

Sean

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation