[Bench!][Mir] +54%..+185% performance boost for Mersenne Twister. (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister. (page 2)

November 27, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by jmh530
in reply to Ilya Yaroshenko

jmh530

Posted in reply to Ilya Yaroshenko

On Sunday, 27 November 2016 at 06:46:20 UTC, Ilya Yaroshenko wrote:
>
> LAPACK API will be a part of Mir GLAS
> https://github.com/libmir/mir-glas --Ilya

Great. I might have asked before, apologies if so.

November 28, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Jacob Carlborg
in reply to Guillaume Piolat

Jacob Carlborg

Posted in reply to Guillaume Piolat

On 2016-11-27 14:28, Guillaume Piolat wrote:

> Without druntime, global ctor/dtor and TLS can't be used too.

Isn't TLS initialized by the OS?

-- 
/Jacob Carlborg

November 28, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Steven Schveighoffer
in reply to Guillaume Piolat

Steven Schveighoffer

Posted in reply to Guillaume Piolat

On 11/27/16 8:58 AM, Guillaume Piolat wrote:
> On Sunday, 27 November 2016 at 13:35:48 UTC, Andrei Alexandrescu wrote:
>>
>>> A useful intermediate step is to have these "[shared] static this" ctor
>>> call a function instead, so that programs without druntime can call them
>>> too.
>>
>> That would be progress.
>>
>
> Same story for core.cpuid which is initialized by a shared static
> constructor.
> https://github.com/dlang/druntime/blob/master/src/core/cpuid.d
>
> It seems the "shared static this" is allowed to break immutable but a
> function wouldn't. So maybe the variables here should be private
> __gshared?.

How we do a similar thing for core.time:

https://github.com/dlang/druntime/blob/master/src/core/time.d#L2388

This function is called from the druntime startup (before any static ctors).

Not sure if it helps or not.

Note that we can cheat the type system here because we know that nothing else will access the data before it's initialized. There is also a check in the function to ensure it's only done once.

-Steve

November 29, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Nordlöw
in reply to Ilya Yaroshenko

Nordlöw

Posted in reply to Ilya Yaroshenko

On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko wrote:
> Bench results:
> ----
> mir.random 32-bit Mt19937:
>     6.80851 Gb/s

Does Gb mean Gigabytes or Gigabits?

November 29, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Ilya Yaroshenko
in reply to Nordlöw

Ilya Yaroshenko

Posted in reply to Nordlöw

On Tuesday, 29 November 2016 at 16:54:55 UTC, Nordlöw wrote:
> On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko wrote:
>> Bench results:
>> ----
>> mir.random 32-bit Mt19937:
>>     6.80851 Gb/s
>
> Does Gb mean Gigabytes or Gigabits?

Gigabits

December 13, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Joseph Rushton Wakeling
in reply to Andrei Alexandrescu

Joseph Rushton Wakeling

Posted in reply to Andrei Alexandrescu

On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei Alexandrescu wrote:
> Congrats! Also thanks for using the Boost license which would allow backporting the improvements to Phobos. Who'd be up for it?

I've finally found a moment to look into this (I'm at home recovering from a seasonal virus, which ironically puts some time and mental space in my hands).

It makes for an interesting comparison.  The traditional Mersenne Twister algorithm generates N random words in one go, and then iterates over them, before generating another N random words, and so on.  (Word == unsigned integer type.)

By contrast Ilya's implementation just updates the single next entry in the static array that stores the state of the engine.  This would seem to work out cheaper on average: perhaps because the update-a-variate code stays hotter, perhaps because it involves less branching and looping.  It certainly makes for a simpler, easier to follow update.

However, it might be a good idea to benchmark this in a context where there are a lot of random variates being generated, but where other things are also happening that might render the update code less "hot" than a straightforward `foreach` loop benchmark.  (Maybe try it out with a Monte Carlo simulation ...?)  The tradeoff between many cheap updates and one expensive one, versus all updates being a little more expensive but less on average, might not be viable in some practical use-cases.

The Phobos implementation also includes a check on whether the generator is properly seeded, which is performed in every `popFront()`, which has a reasonably significant impact.  Ilya's implementation can get away with not performing that check because it uses `@disable this();` to remove the possibility of initializing the generator without it being properly seeded.  The impact of that check can possibly be lessened by making it a boolean comparison rather than the existing `size_t` equality comparison.

  -- action item here: could we deprecate `this()` in Phobos as a precursor to removing
     the opportunity to instantiate a generator without properly initializing it?

I'm going to try to put together a range-based version to see if this also makes any difference.  I'll post some benchmarks of my own once that's done, and if all looks good I'll try to put a Phobos PR together.

A few questions -- not blockers, but nice to understand:

  * @Ilya, is this implementation your own design, or do you have a reference
    for the rationale behind this revised implementation of MT?

  * Is there a particular reason why the `index` variable is a `Uint` (i.e.
    the word type of the generator) rather than a `size_t`?  I presume there's
    a motivation, since the casting back and forth in `opCall` would otherwise
    seem inconvenient.

  * Is there a motivation for the reverse iteration through the RNG state array?

December 13, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Joseph Rushton Wakeling
in reply to Joseph Rushton Wakeling

Joseph Rushton Wakeling

Posted in reply to Joseph Rushton Wakeling

On Tuesday, 13 December 2016 at 18:15:25 UTC, Joseph Rushton Wakeling wrote:
> I'm going to try to put together a range-based version to see if this also makes any difference.  I'll post some benchmarks of my own once that's done, and if all looks good I'll try to put a Phobos PR together.

Benchmark is here:
https://github.com/WebDrake/mersenne-twister-range

Overall these benchmarks seem to have reasonably wide variance.  I note that the original benchmark's attempt to make the code "hot" had quite a strong impact in terms of the ordering of individual benchmarks; I've tried to work around that by doing the "make it hot" loop immediately before the benchmark.

This is still just a plain old foreach, so it may be worth exploring some benchmarks where other stuff is done with the random variates.  As a first step I've added some benchmarks using `uniform`.

Sample output from 2 different runs of these benchmarks:

-----
Running ./mir-mt-range
opCall Mir 32-bit Mt19937: 6.68058 Gb/s
check sum = 429472035921730457

range Mir 32-bit Mt19937: 6.639 Gb/s
check sum = 429472031909629874

opCall Mir 64-bit Mt19937: 14.0351 Gb/s
check sum = 5606740663277085587

range Mir 64-bit Mt19937: 13.8229 Gb/s
check sum = 3636092631268073456

Phobos 32-bit Mt19937: 4.22164 Gb/s
check sum = 429472035921730457

Phobos 64-bit Mt19937: wrong initialization and tempering

range Mir 32-bit Mt19937: 95.4198 U(0.0, 1.0) variates/s

range Mir 64-bit Mt19937: 87.184 U(0.0, 1.0) variates/s

Phobos 32-bit Mt19937: 80.8407 U(0.0, 1.0) variates/s
----

----
Running ./mir-mt-range
opCall Mir 32-bit Mt19937: 6.36183 Gb/s
check sum = 429472035921730457

range Mir 32-bit Mt19937: 6.54397 Gb/s
check sum = 429472031909629874

opCall Mir 64-bit Mt19937: 13.7339 Gb/s
check sum = 5606740663277085587

range Mir 64-bit Mt19937: 13.8229 Gb/s
check sum = 3636092631268073456

Phobos 32-bit Mt19937: 4.18848 Gb/s
check sum = 429472035921730457

Phobos 64-bit Mt19937: wrong initialization and tempering

range Mir 32-bit Mt19937: 95.2381 U(0.0, 1.0) variates/s

range Mir 64-bit Mt19937: 97.0874 U(0.0, 1.0) variates/s

Phobos 32-bit Mt19937: 88.1057 U(0.0, 1.0) variates/s
----

December 13, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Ilya Yaroshenko
in reply to Joseph Rushton Wakeling

Ilya Yaroshenko

Posted in reply to Joseph Rushton Wakeling

On Tuesday, 13 December 2016 at 18:15:25 UTC, Joseph Rushton Wakeling wrote:
> On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei Alexandrescu wrote:
>> Congrats! Also thanks for using the Boost license which would allow backporting the improvements to Phobos. Who'd be up for it?
>
> I've finally found a moment to look into this (I'm at home recovering from a seasonal virus, which ironically puts some time and mental space in my hands).
>
> It makes for an interesting comparison.  The traditional Mersenne Twister algorithm generates N random words in one go, and then iterates over them, before generating another N random words, and so on.  (Word == unsigned integer type.)
>
> By contrast Ilya's implementation just updates the single next entry in the static array that stores the state of the engine.  This would seem to work out cheaper on average: perhaps because the update-a-variate code stays hotter, perhaps because it involves less branching and looping.  It certainly makes for a simpler, easier to follow update.
>
> However, it might be a good idea to benchmark this in a context where there are a lot of random variates being generated, but where other things are also happening that might render the update code less "hot" than a straightforward `foreach` loop benchmark.  (Maybe try it out with a Monte Carlo simulation ...?)
>  The tradeoff between many cheap updates and one expensive one, versus all updates being a little more expensive but less on average, might not be viable in some practical use-cases.
>
> The Phobos implementation also includes a check on whether the generator is properly seeded, which is performed in every `popFront()`, which has a reasonably significant impact.  Ilya's implementation can get away with not performing that check because it uses `@disable this();` to remove the possibility of initializing the generator without it being properly seeded.  The impact of that check can possibly be lessened by making it a boolean comparison rather than the existing `size_t` equality comparison.
>
>   -- action item here: could we deprecate `this()` in Phobos as a precursor to removing
>      the opportunity to instantiate a generator without properly initializing it?
>
> I'm going to try to put together a range-based version to see if this also makes any difference.  I'll post some benchmarks of my own once that's done, and if all looks good I'll try to put a Phobos PR together.
>
> A few questions -- not blockers, but nice to understand:
>
>   * @Ilya, is this implementation your own design, or do you have a reference
>     for the rationale behind this revised implementation of MT?

My own.

>   * Is there a particular reason why the `index` variable is a `Uint` (i.e.
>     the word type of the generator) rather than a `size_t`?  I presume there's
>     a motivation, since the casting back and forth in `opCall` would otherwise
>     seem inconvenient.

If I remember correctly it is used with Using, so I use the same type.

>   * Is there a motivation for the reverse iteration through the RNG state array?

Yes, it composes add/sub with cmp operations.

Ilya

December 13, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Ilya Yaroshenko
in reply to Ilya Yaroshenko

Ilya Yaroshenko

Posted in reply to Ilya Yaroshenko

>
> If I remember correctly it is used with Using, so I use the same type.

Using -> Uint
Sorry, it is phone keyboard

December 14, 2016

Re: [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Posted by Joseph Rushton Wakeling
in reply to Ilya Yaroshenko

Joseph Rushton Wakeling

Posted in reply to Ilya Yaroshenko

On Tuesday, 13 December 2016 at 23:18:26 UTC, Ilya Yaroshenko wrote:
>>   * @Ilya, is this implementation your own design, or do you have a reference
>>     for the rationale behind this revised implementation of MT?
>
> My own.

Congratulations, then -- I think this is a very interesting rewrite of the algorithm and I would encourage you to write up a paper on your choice of optimizations, if that is of interest to you.

>>   * Is there a particular reason why the `index` variable is a `Uint` (i.e.
>>     the word type of the generator) rather than a `size_t`?  I presume there's
>>     a motivation, since the casting back and forth in `opCall` would otherwise
>>     seem inconvenient.
>
> If I remember correctly it is used with Uint, so I use the same type.

Not anywhere that I can find.  `n`, the template parameter from which `this.index` is initialized, is a `size_t`, while the value used inside `opCall` is a `sizediff_t`.

TBH I assumed it was related to the bit-packing of the struct, rather than any question of usage.

>>   * Is there a motivation for the reverse iteration through the RNG state array?
>
> Yes, it composes add/sub with cmp operations.

Cool.  Last question: IIUC you use the private `_z` parameter as a cache for the most recent `data[index]` value (and AFAICT that's the only use it has).  Is there a good reason for doing this, rather than just setting `z = data[index]` inside `opCall` (where `z` is the local parameter inside the method)?

I ask because it doesn't seem to make any practical difference to the benchmarking outcomes, and you have to look up `data[index]` anyway when you write it, so it's not obvious to me that it really saves anything in practice.  (I confess I haven't looked at the assembly, so I might be missing something.)

Thanks for the answers, and apologies for my slow follow-up on some of the things we discussed earlier related to mir.random -- it's been a bit of a busy patch ... :-\

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation