Thread overview
[Issue 3738] New: MinstdRand0 and MinstdRand very poor performance
Jan 24, 2010
Witold Baryluk
Jan 24, 2010
Witold Baryluk
Jan 24, 2010
Witold Baryluk
Jan 24, 2010
Witold Baryluk
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738

           Summary: MinstdRand0 and MinstdRand very poor performance
           Product: D
           Version: 2.039
          Platform: x86
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: baryluk@smp.if.uj.edu.pl


--- Comment #0 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-23 17:52:24 PST ---
MinstdRand0 is instantiation of CongruentialGenerator with constants
  a = 16807
  c = 0
  m = 2^^31 - 1

most important method of CongruetialGenerator popFront, doesn't take into the
account special structure of c,
which makes this generator quite slow especially compared to Mersen Twister
Mt19937.

This is important especially on the 32bit machines, because modulo operation (%) is done using external (and notinlined) method __ULDIV__ which computes unsigned long division and reminder. But it is also a problem on 64bit machines, becuase even in hardware modulo operation is slow.

Currently compiler uses known trick for x % c, when c=2^^k is constant. It changes it to binary and: x & (c - 1)

This is done to uint directly. For signed x it compiler emits 2 additional 'sub's to compensate for sign. For smaller than int it also works. If c=2^^k is bigger than x.max it should be noop (and maybe small warning), but it isn't really important. For ulong on 32bit it even works smarter by just anding with only upper register (or lower and zeroing upper). This is all done even without -O switch.

Not many programer know that there is also a trick for constant c = 2^^k - 1.

You can find some information about it (special case c= 2^^32 - 1) in this
article on page 6.

G. Marsaglia, 'On the randomness of Pi and other decimal expansions', Interstat, October 2005, #5, http://interstat.statjournals.net/YEAR/2005/articles/0510005.pdf


What is important for us is that currently MinstdRand{,0} is very slow. Much
slower than MersenTwistter,
which is more complex than MinstdRand, which is quite supprising.

I propose changing current popFront of LinearCongruentialEngine:

     void popFront()
     {
         static if (m)
-            _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
+            static if (is(UIntType == uint) && m == 4294967295uL) {
+                 const ulong x = (cast(ulong) a * _x + c);
+                 const ulong v = x >> 32;
+                 const ulong w = (x & 0xffff_ffffuL);
+                 const uint y = cast(uint)(v+w);
+                 _x = (y < v || y >= 4294967295u) ? (y+1) : y;
+            } else static if (is(UIntType == uint) && m == 2147483647u) {
+                 const ulong x = (cast(ulong) a * _x + c);
+                 const ulong v = x >> 31;
+                 const ulong w = (x & 0x7fff_ffffuL);
+                 const uint y = cast(uint)(v+w);
+                 _x = (y < v || y >= 2147483647u) ? ((y+1) & 0x7fff_ffffu) :
y;
+            } else {
+                _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
+            }
         else
             _x = a * _x + c;
     }

According to my test first 10_000_000_000 random numbers (few periods) are exactly the same with and without this change. I provided also 2^^32-1 version, I tested it in different place, but there are also some generator with such modulus.

It can by also easly changed to 2^^63-1 and 2^^64-1, but it will need ucent
support,
which we lack currently. But it will allow implementing one very good
generators
(read bellow).

I hope it can be made general and puted directly into backends of DVD and LLVM,
GCC to make it simpler
for programer. Currently non of D or C compiler which I tested uses this trick.

Time mesurments for original code summarizes table.
All test performed under Debian Gnu/Linux, unstable, dmd 2.039, Linux
2.6.33-rc4-sredniczarny-00399-g24bc734-dirty.
Cpu:
model name    : Intel(R) Pentium(R) M processor 1.73GHz
stepping    : 8
cpu MHz        : 1733.000
cache size    : 2048 KB

user root, realtime RR scheduler, priority 50 (this gives deviations smaller
than 1%).

compiled in "-O -inline -release" mode, with all generators in the same file as main function.

1_000_000_000 sample points.

MinstdRand0, realtime rr 50:
   standard: 44.62 s
   inline: 41.04
   standard with tricks: 17.03 s
   inline with tricks: 17.00 s

MinstdRand, realtime rr 50
  standard: 44.88 s
  inline: 41.06
  standard with tricks: 17.05 s
  inline with tricks: 17.02 s

Mt19937, realtime rr 50:
  standard: 33.93 s



"Inline" is manually inlined version (interesingly it isn't always faster), probably some problems with registers, because dmd2 isn't inlining popFront and front functions unfortunetly (even that they are very simple). Tricks is version with patch above.

As you can see after changes this method is about 2.6 times faster. And it is faster than Mersen Twister.

This is quite interesting, becuase patched code have 2 branches, 3 binary ands, and 2 additions. But as maybe few know, DIV/MOD operation is done using quite big external (notinlined) assembler routing __ULDIV__, which is just slow for us. So it can be usefull to also create utility function for our special mod operation.

Unittests still passes.






MinstdRand0 and MinstdRand are pretty good generators, for this speed.
But unfortunetly they fail lots of tests (see TestU01 documentation). Mosly
becuase of small period.

I know that they are in Phobos primarly becuase of simplicity (which maybe my patch destroys, if so please incorporate it directly into compiler, so also others can be beneficial of this).


There are only few which are better, so simple and very fast.

One is m=2^^61-1,a=2^^30-2^^19. But unfortunetly it is slow on 32bit machines
(128bit intermidiates). It is fast on 64bit machines (about 10% slower than
MinstdRand0), but have much better statistical properties (see Wu 1997).

One of very interesting and extra simple generator is Marsa-LFIB4 by Marsaglia.
It is just:
   x_i = (x_{i-55} + x_{i-119} + x_{i-179} + x_{i-256}) mod 2^^32
no multiplications and no complicated modulo operations. It is very fast
(especially on 64-bit machines), have period of 2^^287 and passes practically
all tests (actually all). 2 lines of code.

Other is KISS99 also by Marsaglia. it passes also all tests. Period 2^^123. It is combination of 3 very simple generators. 4 Lines of code.

There also few LaggedFibbonaci generators which are ultra fast. But only on 64bit machines. They passes all tests. One example is LFib(2^^64, 1279, 861, *) with period 2^^1340.

I also know one fast random generator derived from cryptographic primitieves, notably from Rijandle/AES cipher. It is known as ISAAC by Jenkins and is fvery ast, and passes all tests. Code have about 130 lines.

There is also Matlab randn function which is simple (xor combination of simple LCG with m=2^^32 and simple shifting generator. both by Marsaglia) and have period 2^^64, is very fast and passes most tests. It is used there for Normal varietes, but as integer generator it is very good.


All above presented generators are faster and better (according to tests) than
Mt19937.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738



--- Comment #1 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-23 18:29:30 PST ---
Ok, sorry. Not all are better than Mt19937. But definitly better than MinstdRand{,0}. And few are acutally better (empirically).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei@metalanguage.com


--- Comment #2 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-01-23 23:09:46 PST ---
Thanks for the fix. I'll incorporate the patch into Phobos. Walter will decide whether he wants to pursue a compiler change.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738


Witold Baryluk <baryluk@smp.if.uj.edu.pl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEW


--- Comment #3 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-24 08:35:08 PST ---
One can observe that ?: line (branch2)
        _x = (y < v || y >= 2147483647) ? ((y+1) & 0x7fff_ffffu) : y;
can be changed , first to not contain not needed y<v, to  (branch1)
        _x = (y >= 2147483647) ? ((y+1) & 0x7fff_ffffu) : y;
or even remove branch completly (branchlessA)
        _x = (y & 0x7fff_ffff) + (y >> 31);
and possibly rearange terms (branchlessB)
        _x = (y + (y >> 31)) & 0x7fff_ffff;


MinstdRnd0
  standard: 44.65 s
  inline: 41.11 s
  standard with tricks (2 branches): 17.08 s
  inline with tricks (2 branches): 17.04 s
  standard with tricks (1 branch): 15.24 s
  inline with tricks (1 branch): 13.59 s
  standard with tricks (branchlesA): 16.50 s
  inline with tricks (branchelesA): 15.82 s
  standard with tricks (branchlesB):  17.74 s
  inline with tricks (branchelesB): 15.30 s

MinstdRnd
  standard: 44.95 s
  inline: 41.17 s
  standard with tricks (2 branch): 17.04 s
  inline with tricks (2 branche): 17.03 s
  standard with tricks (1 branch): 15.29 s
  inline with tricks (1 branche): 13.54 s
  standard with tricks (branchlesA): 16.46 s
  inline with tricks (branchelessA): 15.88 s
  standard with tricks (branchlesB): 18.10 s
  inline with tricks (branchelessB): 15.31 s

So fastest is currently branch1 version.

Updated patch:

     void popFront()
     {
         static if (m)
-            _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
+            static if (is(UIntType == uint) && m == 4294967295uL) {
+                 const ulong x = (cast(ulong) a * _x + c);
+                 const ulong v = x >> 32;
+                 const ulong w = (x & 0xffff_ffffuL);
+                 const uint y = cast(uint)(v+w);
+                 _x = (y < v || y >= 4294967295u) ? (y+1) : y;
+            } else static if (is(UIntType == uint) && m == 2147483647u) {
+                 const ulong x = (cast(ulong) a * _x + c);
+                 const ulong v = x >> 31;
+                 const ulong w = (x & 0x7fff_ffffuL);
+                 const uint y = cast(uint)(v+w);
+                 _x = (y >= 2147483647u) ? ((y+1) & 0x7fff_ffffu) : y;
+            } else {
+                _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
+            }
         else
             _x = a * _x + c;
     }

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
         AssignedTo|nobody@puremagic.com        |andrei@metalanguage.com


--- Comment #4 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-01-24 09:19:04 PST ---
Here's the modified version as I plan to commit:

    void popFront()
    {
        static if (m)
        {
            static if (is(UIntType == uint) && m == uint.max)
            {
                immutable ulong
                    x = (cast(ulong) a * _x + c),
                    v = x >> 32,
                    w = x & uint.max;
                immutable y = cast(uint)(v + w);
                _x = (y < v || y == uint.max) ? (y + 1) : y;
            }
            else static if (is(UIntType == uint) && m == int.max)
            {
                immutable ulong
                    x = (cast(ulong) a * _x + c),
                    v = x >> 31,
                    w = x & int.max;
                const uint y = cast(uint)(v + w);
                _x = (y >= int.max) ? ((y + 1) & int.max) : y;
            }
            else
            {
                _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
            }
        }
        else
        {
            _x = a * _x + c;
        }
    }

I changed formatting a bit and literals with symbolic constants. Unfortunately unittests don't fail when I slightly change some constants :o|.

Looks good?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738



--- Comment #5 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-24 12:03:20 PST ---
All my tests pases on your version.
10_000_000_000 (more than all possibilities) passes (for MinstdRand0 and
MinstdRand, it is possible that for other a it will fail, but i tested and done
some calculation.  and it should work for all a,c<m.
I also tested versions with slightly changed constants (like 30 instad of 31,
or int.max-1, int.max-2, int.max-3, int.max+1 in multiple comibinations an
allone), and all them fail.


Shouldn't const uint y by immucatable y, just like in first static if?


I think that our pretty optimised line, can be further optimalised: (1 branch
sub)

-             _x = (y >= int.max) ? ((y + 1) & int.max) : y;
+             _x = (y >= int.max) ? (y - int.max) : y;

:)

This gives:
MinstdRand0:
  standard with tricks (1 branch sub): 15.25
  inline with tricks (1 branch sub): 13.53 s
MinstdRand:
  standard with tricks (1 branch sub): 15.26
  inline with tricks (1 branch sub): 13.52 s

It doesn't give much difference (maybe 2%). But i thinkg it is better to have one substraction, than one add and one binary and.



I found few interesting materials about this m=2^^31-1 operation. One is http://www.firstpr.com.au/dsp/rand31/

I tested this Carta's versions: But they are only relevant for a=16807 (out Park-Miller's StdminRand0 generator), so pretty limited:

Carta1 (original):
      uint lo = 16807*(_x & 0xFFFFu);
      immutable uint hi = 16807*(_x >> 16);
      lo += (hi & 0x7FFF) << 16;
      lo += (hi >> 15);
      _x = (lo > int.max) ? (lo - int.max) : lo;

Carta3:
      uint lo = 16807*(_x & 0xFFFFu);
      immutable uint hi = 16807*(_x >> 16);
      lo += ((hi & 0x7FFF) << 16) + (hi >> 15);
      _x = (lo > int.max) ? (lo - int.max) : lo;

Carta2:
        immutable uint
           hi = 16807*(_x >> 16),
           lo 16807*(_x & 0xFFFFu) + ((hi & 0x7FFF) << 16) + (hi >> 15);
           _x = (lo > int.max) ? (lo - int.max) : lo;

Last lines can also be changed to branchles:
           _x = (lo > int.max) ? (lo - int.max) : lo;

Timings:
Carta code1: 22.30 s
Carta code1 branchless: 18.42
Carta code3: 21.84 s
Carta code3 branchless: 18.19 s
Carta code2: 23.84 s
Carta code2 branchless: 19.84 s

So it is slower in all cases, and it is only for speciall cases (c=0, a < 2^^15). Of course answers are still correct.


Performing uint*uint+uint -> ulong is just faster than their tricks, and more general, and can be reused in many other places.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=3738


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #6 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-01-24 15:32:46 PST ---
(In reply to comment #5)
> All my tests pases on your version.
[snip]

Thanks! Committed with revision 1406.

Any chance you send us your unittests so I can integrate them within the module?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------