Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
January 24, 2010 [Issue 3738] New: MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
http://d.puremagic.com/issues/show_bug.cgi?id=3738 Summary: MinstdRand0 and MinstdRand very poor performance Product: D Version: 2.039 Platform: x86 OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Phobos AssignedTo: nobody@puremagic.com ReportedBy: baryluk@smp.if.uj.edu.pl --- Comment #0 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-23 17:52:24 PST --- MinstdRand0 is instantiation of CongruentialGenerator with constants a = 16807 c = 0 m = 2^^31 - 1 most important method of CongruetialGenerator popFront, doesn't take into the account special structure of c, which makes this generator quite slow especially compared to Mersen Twister Mt19937. This is important especially on the 32bit machines, because modulo operation (%) is done using external (and notinlined) method __ULDIV__ which computes unsigned long division and reminder. But it is also a problem on 64bit machines, becuase even in hardware modulo operation is slow. Currently compiler uses known trick for x % c, when c=2^^k is constant. It changes it to binary and: x & (c - 1) This is done to uint directly. For signed x it compiler emits 2 additional 'sub's to compensate for sign. For smaller than int it also works. If c=2^^k is bigger than x.max it should be noop (and maybe small warning), but it isn't really important. For ulong on 32bit it even works smarter by just anding with only upper register (or lower and zeroing upper). This is all done even without -O switch. Not many programer know that there is also a trick for constant c = 2^^k - 1. You can find some information about it (special case c= 2^^32 - 1) in this article on page 6. G. Marsaglia, 'On the randomness of Pi and other decimal expansions', Interstat, October 2005, #5, http://interstat.statjournals.net/YEAR/2005/articles/0510005.pdf What is important for us is that currently MinstdRand{,0} is very slow. Much slower than MersenTwistter, which is more complex than MinstdRand, which is quite supprising. I propose changing current popFront of LinearCongruentialEngine: void popFront() { static if (m) - _x = cast(UIntType) ((cast(ulong) a * _x + c) % m); + static if (is(UIntType == uint) && m == 4294967295uL) { + const ulong x = (cast(ulong) a * _x + c); + const ulong v = x >> 32; + const ulong w = (x & 0xffff_ffffuL); + const uint y = cast(uint)(v+w); + _x = (y < v || y >= 4294967295u) ? (y+1) : y; + } else static if (is(UIntType == uint) && m == 2147483647u) { + const ulong x = (cast(ulong) a * _x + c); + const ulong v = x >> 31; + const ulong w = (x & 0x7fff_ffffuL); + const uint y = cast(uint)(v+w); + _x = (y < v || y >= 2147483647u) ? ((y+1) & 0x7fff_ffffu) : y; + } else { + _x = cast(UIntType) ((cast(ulong) a * _x + c) % m); + } else _x = a * _x + c; } According to my test first 10_000_000_000 random numbers (few periods) are exactly the same with and without this change. I provided also 2^^32-1 version, I tested it in different place, but there are also some generator with such modulus. It can by also easly changed to 2^^63-1 and 2^^64-1, but it will need ucent support, which we lack currently. But it will allow implementing one very good generators (read bellow). I hope it can be made general and puted directly into backends of DVD and LLVM, GCC to make it simpler for programer. Currently non of D or C compiler which I tested uses this trick. Time mesurments for original code summarizes table. All test performed under Debian Gnu/Linux, unstable, dmd 2.039, Linux 2.6.33-rc4-sredniczarny-00399-g24bc734-dirty. Cpu: model name : Intel(R) Pentium(R) M processor 1.73GHz stepping : 8 cpu MHz : 1733.000 cache size : 2048 KB user root, realtime RR scheduler, priority 50 (this gives deviations smaller than 1%). compiled in "-O -inline -release" mode, with all generators in the same file as main function. 1_000_000_000 sample points. MinstdRand0, realtime rr 50: standard: 44.62 s inline: 41.04 standard with tricks: 17.03 s inline with tricks: 17.00 s MinstdRand, realtime rr 50 standard: 44.88 s inline: 41.06 standard with tricks: 17.05 s inline with tricks: 17.02 s Mt19937, realtime rr 50: standard: 33.93 s "Inline" is manually inlined version (interesingly it isn't always faster), probably some problems with registers, because dmd2 isn't inlining popFront and front functions unfortunetly (even that they are very simple). Tricks is version with patch above. As you can see after changes this method is about 2.6 times faster. And it is faster than Mersen Twister. This is quite interesting, becuase patched code have 2 branches, 3 binary ands, and 2 additions. But as maybe few know, DIV/MOD operation is done using quite big external (notinlined) assembler routing __ULDIV__, which is just slow for us. So it can be usefull to also create utility function for our special mod operation. Unittests still passes. MinstdRand0 and MinstdRand are pretty good generators, for this speed. But unfortunetly they fail lots of tests (see TestU01 documentation). Mosly becuase of small period. I know that they are in Phobos primarly becuase of simplicity (which maybe my patch destroys, if so please incorporate it directly into compiler, so also others can be beneficial of this). There are only few which are better, so simple and very fast. One is m=2^^61-1,a=2^^30-2^^19. But unfortunetly it is slow on 32bit machines (128bit intermidiates). It is fast on 64bit machines (about 10% slower than MinstdRand0), but have much better statistical properties (see Wu 1997). One of very interesting and extra simple generator is Marsa-LFIB4 by Marsaglia. It is just: x_i = (x_{i-55} + x_{i-119} + x_{i-179} + x_{i-256}) mod 2^^32 no multiplications and no complicated modulo operations. It is very fast (especially on 64-bit machines), have period of 2^^287 and passes practically all tests (actually all). 2 lines of code. Other is KISS99 also by Marsaglia. it passes also all tests. Period 2^^123. It is combination of 3 very simple generators. 4 Lines of code. There also few LaggedFibbonaci generators which are ultra fast. But only on 64bit machines. They passes all tests. One example is LFib(2^^64, 1279, 861, *) with period 2^^1340. I also know one fast random generator derived from cryptographic primitieves, notably from Rijandle/AES cipher. It is known as ISAAC by Jenkins and is fvery ast, and passes all tests. Code have about 130 lines. There is also Matlab randn function which is simple (xor combination of simple LCG with m=2^^32 and simple shifting generator. both by Marsaglia) and have period 2^^64, is very fast and passes most tests. It is used there for Normal varietes, but as integer generator it is very good. All above presented generators are faster and better (according to tests) than Mt19937. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 24, 2010 [Issue 3738] MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
Posted in reply to Witold Baryluk | http://d.puremagic.com/issues/show_bug.cgi?id=3738 --- Comment #1 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-23 18:29:30 PST --- Ok, sorry. Not all are better than Mt19937. But definitly better than MinstdRand{,0}. And few are acutally better (empirically). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 24, 2010 [Issue 3738] MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
Posted in reply to Witold Baryluk | http://d.puremagic.com/issues/show_bug.cgi?id=3738 Andrei Alexandrescu <andrei@metalanguage.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |andrei@metalanguage.com --- Comment #2 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-01-23 23:09:46 PST --- Thanks for the fix. I'll incorporate the patch into Phobos. Walter will decide whether he wants to pursue a compiler change. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 24, 2010 [Issue 3738] MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
Posted in reply to Witold Baryluk | http://d.puremagic.com/issues/show_bug.cgi?id=3738 Witold Baryluk <baryluk@smp.if.uj.edu.pl> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEW --- Comment #3 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-24 08:35:08 PST --- One can observe that ?: line (branch2) _x = (y < v || y >= 2147483647) ? ((y+1) & 0x7fff_ffffu) : y; can be changed , first to not contain not needed y<v, to (branch1) _x = (y >= 2147483647) ? ((y+1) & 0x7fff_ffffu) : y; or even remove branch completly (branchlessA) _x = (y & 0x7fff_ffff) + (y >> 31); and possibly rearange terms (branchlessB) _x = (y + (y >> 31)) & 0x7fff_ffff; MinstdRnd0 standard: 44.65 s inline: 41.11 s standard with tricks (2 branches): 17.08 s inline with tricks (2 branches): 17.04 s standard with tricks (1 branch): 15.24 s inline with tricks (1 branch): 13.59 s standard with tricks (branchlesA): 16.50 s inline with tricks (branchelesA): 15.82 s standard with tricks (branchlesB): 17.74 s inline with tricks (branchelesB): 15.30 s MinstdRnd standard: 44.95 s inline: 41.17 s standard with tricks (2 branch): 17.04 s inline with tricks (2 branche): 17.03 s standard with tricks (1 branch): 15.29 s inline with tricks (1 branche): 13.54 s standard with tricks (branchlesA): 16.46 s inline with tricks (branchelessA): 15.88 s standard with tricks (branchlesB): 18.10 s inline with tricks (branchelessB): 15.31 s So fastest is currently branch1 version. Updated patch: void popFront() { static if (m) - _x = cast(UIntType) ((cast(ulong) a * _x + c) % m); + static if (is(UIntType == uint) && m == 4294967295uL) { + const ulong x = (cast(ulong) a * _x + c); + const ulong v = x >> 32; + const ulong w = (x & 0xffff_ffffuL); + const uint y = cast(uint)(v+w); + _x = (y < v || y >= 4294967295u) ? (y+1) : y; + } else static if (is(UIntType == uint) && m == 2147483647u) { + const ulong x = (cast(ulong) a * _x + c); + const ulong v = x >> 31; + const ulong w = (x & 0x7fff_ffffuL); + const uint y = cast(uint)(v+w); + _x = (y >= 2147483647u) ? ((y+1) & 0x7fff_ffffu) : y; + } else { + _x = cast(UIntType) ((cast(ulong) a * _x + c) % m); + } else _x = a * _x + c; } -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 24, 2010 [Issue 3738] MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
Posted in reply to Witold Baryluk | http://d.puremagic.com/issues/show_bug.cgi?id=3738 Andrei Alexandrescu <andrei@metalanguage.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED AssignedTo|nobody@puremagic.com |andrei@metalanguage.com --- Comment #4 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-01-24 09:19:04 PST --- Here's the modified version as I plan to commit: void popFront() { static if (m) { static if (is(UIntType == uint) && m == uint.max) { immutable ulong x = (cast(ulong) a * _x + c), v = x >> 32, w = x & uint.max; immutable y = cast(uint)(v + w); _x = (y < v || y == uint.max) ? (y + 1) : y; } else static if (is(UIntType == uint) && m == int.max) { immutable ulong x = (cast(ulong) a * _x + c), v = x >> 31, w = x & int.max; const uint y = cast(uint)(v + w); _x = (y >= int.max) ? ((y + 1) & int.max) : y; } else { _x = cast(UIntType) ((cast(ulong) a * _x + c) % m); } } else { _x = a * _x + c; } } I changed formatting a bit and literals with symbolic constants. Unfortunately unittests don't fail when I slightly change some constants :o|. Looks good? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 24, 2010 [Issue 3738] MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
Posted in reply to Witold Baryluk | http://d.puremagic.com/issues/show_bug.cgi?id=3738 --- Comment #5 from Witold Baryluk <baryluk@smp.if.uj.edu.pl> 2010-01-24 12:03:20 PST --- All my tests pases on your version. 10_000_000_000 (more than all possibilities) passes (for MinstdRand0 and MinstdRand, it is possible that for other a it will fail, but i tested and done some calculation. and it should work for all a,c<m. I also tested versions with slightly changed constants (like 30 instad of 31, or int.max-1, int.max-2, int.max-3, int.max+1 in multiple comibinations an allone), and all them fail. Shouldn't const uint y by immucatable y, just like in first static if? I think that our pretty optimised line, can be further optimalised: (1 branch sub) - _x = (y >= int.max) ? ((y + 1) & int.max) : y; + _x = (y >= int.max) ? (y - int.max) : y; :) This gives: MinstdRand0: standard with tricks (1 branch sub): 15.25 inline with tricks (1 branch sub): 13.53 s MinstdRand: standard with tricks (1 branch sub): 15.26 inline with tricks (1 branch sub): 13.52 s It doesn't give much difference (maybe 2%). But i thinkg it is better to have one substraction, than one add and one binary and. I found few interesting materials about this m=2^^31-1 operation. One is http://www.firstpr.com.au/dsp/rand31/ I tested this Carta's versions: But they are only relevant for a=16807 (out Park-Miller's StdminRand0 generator), so pretty limited: Carta1 (original): uint lo = 16807*(_x & 0xFFFFu); immutable uint hi = 16807*(_x >> 16); lo += (hi & 0x7FFF) << 16; lo += (hi >> 15); _x = (lo > int.max) ? (lo - int.max) : lo; Carta3: uint lo = 16807*(_x & 0xFFFFu); immutable uint hi = 16807*(_x >> 16); lo += ((hi & 0x7FFF) << 16) + (hi >> 15); _x = (lo > int.max) ? (lo - int.max) : lo; Carta2: immutable uint hi = 16807*(_x >> 16), lo 16807*(_x & 0xFFFFu) + ((hi & 0x7FFF) << 16) + (hi >> 15); _x = (lo > int.max) ? (lo - int.max) : lo; Last lines can also be changed to branchles: _x = (lo > int.max) ? (lo - int.max) : lo; Timings: Carta code1: 22.30 s Carta code1 branchless: 18.42 Carta code3: 21.84 s Carta code3 branchless: 18.19 s Carta code2: 23.84 s Carta code2 branchless: 19.84 s So it is slower in all cases, and it is only for speciall cases (c=0, a < 2^^15). Of course answers are still correct. Performing uint*uint+uint -> ulong is just faster than their tricks, and more general, and can be reused in many other places. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 24, 2010 [Issue 3738] MinstdRand0 and MinstdRand very poor performance | ||||
---|---|---|---|---|
| ||||
Posted in reply to Witold Baryluk | http://d.puremagic.com/issues/show_bug.cgi?id=3738 Andrei Alexandrescu <andrei@metalanguage.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #6 from Andrei Alexandrescu <andrei@metalanguage.com> 2010-01-24 15:32:46 PST --- (In reply to comment #5) > All my tests pases on your version. [snip] Thanks! Committed with revision 1406. Any chance you send us your unittests so I can integrate them within the module? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
Copyright © 1999-2021 by the D Language Foundation