February 27, 2020
On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal wrote:
> big snip
TL;DR for the snipped: Unsurprisingly, different inputs will lead to different timing results.  The equi-probable values supplied by a standard PRNG differ significantly from an equi-probable digit input.  In particular, the PRNG input is much more predictable.

Note also that a uint PRNG will produce numbers that need 10 decimal digits.  If my arithmetic is correct, you'll see "9 or better" out of a uint PRNG 97.79% of the time.  The predictability of the full set of inputs is, secondarily but importantly, also influenced by the ability of the branch predictor to ferret out combinations.

I lack confidence in my ability to reason from first principles when presented with a modern CPU branch predictor and a complex input.  Much easier for me to just run tests against various distributions that I think span the input and reason from the results.






February 27, 2020
On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal wrote:
> On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
> I will post my code if there is any meaningful difference in your subsequent results.

give me something I can compile and verify. I'm not there to steal, if you found something you can still propose it to the repos that would take advantage of the optim.
February 27, 2020
On Thursday, 27 February 2020 at 17:11:48 UTC, Basile B. wrote:
> On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal wrote:
>> On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
>> I will post my code if there is any meaningful difference in your subsequent results.
>
> give me something I can compile and verify. I'm not there to steal, if you found something you can still propose it to the repos that would take advantage of the optim.

I'm not at all concerned about theft of trivial code.  I am concerned that a simple error in my code will live on in a copy/paste environment.

Regardless, I'll post the code once I get home.  It may be the only way to communicate the central problem as I see it: imprecision in the test specification (the input specification).

February 27, 2020
On Thursday, 27 February 2020 at 17:17:32 UTC, Bruce Carneal wrote:
> On Thursday, 27 February 2020 at 17:11:48 UTC, Basile B. wrote:
>> On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal wrote:
>>> On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
>>> I will post my code if there is any meaningful difference in your subsequent results.
>>
>> give me something I can compile and verify. I'm not there to steal, if you found something you can still propose it to the repos that would take advantage of the optim.
>
> I'm not at all concerned about theft of trivial code.  I am concerned that a simple error in my code will live on in a copy/paste environment.
>
> Regardless, I'll post the code once I get home.  It may be the only way to communicate the central problem as I see it: imprecision in the test specification (the input specification).

Yes please, post the benchmark method. You see the benchmarks I run with your version are always slowest. I'm aware that rndGen (and generaly any uniform rnd func) is subject to a bias but I dont thing this bias maters much in the case we talk about.
February 27, 2020
On Thursday, 27 February 2020 at 19:46:23 UTC, Basile B. wrote:
> Yes please, post the benchmark method. You see the benchmarks I run with your version are always slowest. I'm aware that rndGen (and generaly any uniform rnd func) is subject to a bias but I dont thing this bias maters much in the case we talk about.

The code below is the test jig that I'm using currently.  It is adopted from yours but has added the -d=distribution command line option.

I have omitted the function bodies for the entries in the funcs[] table.  You can cut and paste there at will but I wanted to spare any other readers the bulk.

The timings I get from the below are entirely consistent with my earlier reporting and, I believe, with yours.

I urge you to investigate the difference in timings between the original, and currently defaulted, distribution and the -d=edig distribution.  If you don't understand what's going on there, or if you still don't believe input distributions matter, please get in touch.  After all, it's possible that I'm the one with the misunderstanding.

Have fun!

<code>
const funcs = [
    &decimalLength9_0, &decimalLength9_1, &decimalLength9_2,
    &decimalLength9_3, &decimalLength9_4, &decimalLength9_5
];

ulong runOriginalFuncCount(ubyte funcIdx, ulong count)
{
    GC.disable;

    uint sum;
    foreach (const ulong i; 0 .. count)
    {
        sum += funcs[funcIdx](rndGen.front());
        rndGen.popFront();
    }
    return sum;
}

ulong runFuncCountDist(ubyte funcIdx, ulong count, string dist)
{
    enum perDigit = 1000;
    // NOTE: this is a gross distortion given uint.max but it is "the spec"
    enum maxDigits = 9;
    uint[perDigit * maxDigits] samples = void;
    const chunks = count / samples.length;
    const leftOvers = count % samples.length;

    if (dist == "prng")
    {
        foreach (ref val; samples[])
        {
            val = rndGen.front();
            rndGen.popFront();
        }
    }
    else if (dist == "edig" || dist == "edigsorted")
    {
        // for reference, this is the "6 LOC IIRC"
        uint value = 1;
        foreach (dig; 0 .. maxDigits)
        {
            samples[dig * perDigit .. (dig + 1) * perDigit] = value;
            value *= 10;
        }

        if (auto shuffle = dist != "edigsorted")
            randomShuffle(samples[]);
    }
    else
    {
        assert(0, "dist option should be in {oprng, prng, edig, edigsorted}");
    }
    const func = funcs[funcIdx];

    if (count < 1) // late check gives us a handle on setup time
        return 0;
    // OK, enough with the setup, time to roll
    ulong sum = 0;
    foreach (chunk; 0 .. chunks)
    {
        foreach (sample; samples[])
            sum += func(sample);
    }
    foreach (sample; samples[$ - leftOvers .. $])
        sum += func(sample);
    return sum;
}

// This is a timing test jig for functions that accept a uint
// and return the number of decimal digits needed to represent
// that uint (except that 10 digit values are lumped in with
// 9 digit values currently).
//
// The command line options here are:
//   -c=count (the number of values to present to the function selected)
//   -s=seed (the value supplied to rndGen.seed)
//   -f=funcIndex (identifies the function under test, see source funcs[])
//   -d=distribution (one of: oprng, prng, edig, edigsorted)
//
// Distributions explained:
//   oprng: original prng distribution, a rndGen.popFront per function call
//   prng:  prng distribution, also rndGen but coming from a large, pre-filled, array
//   edig:  same number of 1-digit, 2-digit, ... numbers, shuffled
//   edigsorted: edig values but sorted to highlight the branch predictor impact
//
// This test jig could evolve in at least six ways:
//  1) the full digit range could (arguably should) be accounted for
//  2) additional distributions, like "favor small numbers", could be added
//  3) additional implementation ideas can be explored
//  4) the input range could be made variable
//  5) the number base could be templatized
//  6) inlined performance could be tested by aliasing Func (runtime switch to template)
//
//  Final note: the main benefit of working on this problem is better understanding, not
//  shaving nanoseconds on this particular function.

void main(string[] args)
{

    ulong count;
    uint seed;
    ubyte func;
    // NOTE: oprng writeln output will usually not equal that of -d=prng
    enum defaultDistribution = "oprng";
    string dist = defaultDistribution;

    GC.disable;
    getopt(args, config.passThrough, "c|count", &count, "f|func", &func,
            "s|seed", &seed, "d|distribution", &dist);
    rndGen.seed(seed);
    const original = dist == "oprng";
    const sum = original ? runOriginalFuncCount(func, count) : runFuncCountDist(func, count, dist);
    writeln(sum);
}

<code>




February 28, 2020
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>
> [...]

bsr can be done in one/two CPU operation, quite quick. But core.bitop.bsr wouldn't be inlined. Instead, mir-core (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for to get code with inlining.
February 28, 2020
On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>>
>> [...]
>
> bsr can be done in one/two CPU operation, quite quick. But core.bitop.bsr wouldn't be inlined. Instead, mir-core (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for to get code with inlining.

That's surprising.  I just got ldc to inline core.bitop.bsr on run.dlang.io using ldc -O3 -mcpu=native. (not sure what the target CPU is)

Under what conditions should I be guarding against an inlining failure?





February 28, 2020
On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:
> On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
>> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>>> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>>>
>>> [...]
>>
>> bsr can be done in one/two CPU operation, quite quick. But core.bitop.bsr wouldn't be inlined. Instead, mir-core (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for to get code with inlining.
>
> That's surprising.  I just got ldc to inline core.bitop.bsr on run.dlang.io using ldc -O3 -mcpu=native. (not sure what the target CPU is)
>
> Under what conditions should I be guarding against an inlining failure?

Here's the code I used:

int main(string[] args)
{
    import core.bitop : bsr;
    return bsr(cast(uint)args.length);
}

BTW, I'm a huge fan of your performance work.


February 28, 2020
On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:
> On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
>> bsr can be done in one/two CPU operation, quite quick. But core.bitop.bsr wouldn't be inlined. Instead, mir-core (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for to get code with inlining.
>
> That's surprising.  I just got ldc to inline core.bitop.bsr on run.dlang.io using ldc -O3 -mcpu=native.

These tiny core.bitop wrappers around LLVM intrinsics are always inlined (`pragma(inline, true)`), i.e., you don't even need -O.


February 28, 2020
On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:
> On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
>> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>>> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>>>
>>> [...]
>>
>> bsr can be done in one/two CPU operation, quite quick. But core.bitop.bsr wouldn't be inlined. Instead, mir-core (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for to get code with inlining.
>
> That's surprising.  I just got ldc to inline core.bitop.bsr on run.dlang.io using ldc -O3 -mcpu=native. (not sure what the target CPU is)

Ah, my bad. It fails to inline with LDC <= 1.14
https://d.godbolt.org/z/iz9p-6

> Under what conditions should I be guarding against an inlining failure?

Mark it with `pragma(inline, true)`. LDC also has cross-module inlining for non-templated functions.