April 13, 2010
Steven Schveighoffer wrote:
> Assuming you may ask questions about this, it's not an exact description of what happens.  The append code and GC are smarter about it than this, I just wanted to explain it in an easy-to-understand way :)  The real code uses algorithms to determine optimal grow sizes and can extend into consecutive blocks if possible.

I realised that while running some of the code that bearophile suggested to me, because if you look at 'capacity' in a non-pre-reserved D array that's appended to 5 million times, its capacity is only a little over 5 million -- compared to a C++ vector which is the nearest power of 2 (2^23, over 8 million).

The GC overhead means that in this case actual memory usage is a bit higher than C++, but on a larger scale this could surely make a big difference.

In the case of the piece of code I'm working on I don't think the pre-reserve is really so important as far as performance goes -- the append speed is a bit of a bugger, but the main issue is probably to do with speed of copying and iterating across arrays.

For example, I suspect that the D array's,

        x[] = 0;
        y[] = z[];

is not as efficient as a C++ vector's

        x.assign(x.size(),0);
        y.assign(z.begin(),z.end());

I think I can find some ways round this by taking advantage of some of the D arrays' cleverness, limiting the copying of values and instead just swapping around which memory each array is pointing to.

As for iteration, I don't know to what degree D's foreach() across
arrays compares to for() commands in C++ -- I guess it should be pretty
close in performance, no?

Best wishes,

    -- Joe
April 13, 2010
On Tue, 13 Apr 2010 11:15:15 -0400, Joseph Wakeling <joseph.wakeling@webdrake.net> wrote:

> As for iteration, I don't know to what degree D's foreach() across
> arrays compares to for() commands in C++ -- I guess it should be pretty
> close in performance, no?

D special cases foreach on arrays to basically be a standard array for loop, i.e.:

foreach(x; arr)
{
   ...
}

translates to something like:

auto __len = arr.length;
for(uint __i = 0; __i < __len; ++__i)
{
   auto x = arr[__i];
   ...
}

Although it may be even further optimized, I'm not sure.

-Steve
April 13, 2010
Joseph Wakeling wrote:
> Steven Schveighoffer wrote:
>> Assuming you may ask questions about this, it's not an exact description
>> of what happens.  The append code and GC are smarter about it than this,
>> I just wanted to explain it in an easy-to-understand way :)  The real
>> code uses algorithms to determine optimal grow sizes and can extend into
>> consecutive blocks if possible.
> 
> I realised that while running some of the code that bearophile suggested
> to me, because if you look at 'capacity' in a non-pre-reserved D array
> that's appended to 5 million times, its capacity is only a little over 5
> million -- compared to a C++ vector which is the nearest power of 2
> (2^23, over 8 million).
> 
> The GC overhead means that in this case actual memory usage is a bit
> higher than C++, but on a larger scale this could surely make a big
> difference.
> 
> In the case of the piece of code I'm working on I don't think the
> pre-reserve is really so important as far as performance goes -- the
> append speed is a bit of a bugger, but the main issue is probably to do
> with speed of copying and iterating across arrays.
> 
> For example, I suspect that the D array's,
> 
>         x[] = 0;
>         y[] = z[];
> 
> is not as efficient as a C++ vector's
> 
>         x.assign(x.size(),0);
>         y.assign(z.begin(),z.end());

The D code compiles directly to a memset and memcpy, which are intrinsics. There's no way C++ could beat it.
April 13, 2010
Don wrote:
> The D code compiles directly to a memset and memcpy, which are intrinsics. There's no way C++ could beat it.

OK.  I'm just bemused about why -- having solved the memory leak -- my D code is running significantly slower than equivalent C++ code.

For now I'm just playing with D by doing a rewrite of some C++ code I did for a project a while back -- to see if I can write a neater version of the code in the new language, and learn something about D in the process.  I may have fallen into the trap you described earlier, of just 'translating' C++ style code directly into D, but I don't think that's the issue here.

There are a couple of algorithmic inefficiencies in there with deliberate reason, but removing them for testing purposes doesn't improve matters.  The array append is slow, but it's not the major part of the running time of the program.

For the sake of clarity, I've attached the complete code.  The really important part to look at is the file yzlm.d in the opCall() function and class functions objectReputationUpdate() and userReputationUpdate(). This is a reputation-generating process which takes a set of ratings by users of objects, and uses them to iteratively recalculate user reputation and re-aggregate the ratings taking this into account.

I don't see anything particularly bad about the way these algorithms are written but maybe they are imperfectly D-ish ... :-)

Best wishes,

    -- Joe


April 14, 2010
Joseph Wakeling:
> OK.  I'm just bemused about why -- having solved the memory leak -- my D code is running significantly slower than equivalent C++ code.

I'll try to take a look at your D code. In the meantime you can show the equivalent C++ code too, so I can compare.
Note that dmd has an old backend, so it doesn't optimize as well as g++ or as llvm. LDC is often able to optimize better.

Bye,
bearophile
April 14, 2010
So far I've just given a light reading of the code. Notes:
- 1.0L in D is a real, not a double, it's 10-12-16 bytes long.
- Mt19937 is a Mersenne twister, keep in mind that it's really good but slow. If you need less quality, use a faster generator.
- Compile your code with -w, so it says you that you have missed an "override".
- ReputationAlgorithm can be an abstract class, I think. So you can write it as:
abstract class ReputationAlgorithm {
    this() {}
    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject);
}

- Both Yzlm and AvgArithmetic classes can be marked as final, and in the other classes there are other methods can be marked as final. As in Java in D class methods are virtual unless you mark them as final, but unlike the HotSpot of Java as far as I know no D compiler is currently able to inline virtual methods.
- Dmd has both a built-ihn profiler and code coverage, use them to find hot spots.
- Put a space after commas, generally.
- pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you have to import std.math anyway, because of a bug).
- You can learn to use a little of contract programming, and move in a precondition the asserts, for example of the Yzlm constructor. Then you can add class invariants too, if you want.
- Switching the contents of dynamic arrays do work (it doesn't work well with stack-allocated arrays, the contents get copied).

I'll try to add more comments later.
Bye,
bearophile
April 14, 2010
This is a little cleaned up version of the result of the profiling (no inling, final classes, -O -release):

   Func
   Time

7586124  std.random __uniform std.random __MersenneTwisterEngine
4040384  main
2408110  performance userReputationUpdate performance Rating
2299561  std.math __nextafter nextafter
2176746  std.contracts __enforce
2051276  std.conv __to
2006079  performance.objectReputationUpdate performance Rating
1898177  std.array __Appender performance Rating Appender __put performance Rating put performance Rating

It's a little disaster. As I have thought the Mersenne twister is taking up lot of time. And then the nextafter, the enforce, the to!, the appender...

Bye,
bearophile
April 14, 2010
This is a single module, and on my PC is more than two times faster than the original code. Now I think there are no naive ways left to speed it up.

Compile it with:
dmd -O -release -inline

The low performance was caused by:
- virtual methods that don't need to be virtual;
- ^^0.5 and pow(x,2) doesn't perfectly efficient;
- the two nested loops in the main are more efficient as ref double, this is something dmd will need to fix;
- Mersenne twister is slow. I have used a Kiss rnd generator here, adapted from Tango.
- the uniform of std.random is bad, it contains nextafter that's slow, it can be avoided using a closed range: "[]" instead of the default one "[)".
- Somewhere in the middle of the random generator there was an enforce. I am now not so sure it's good to use a lot of enforce() in the std lib of a system language. Maybe Andrei will need to go back using asserts inside contracts.
- The temporary struct inside the two main loops didn't get optimized away
- the append was messy and slow, I have removed the appends and replaced with simpler and tidier code with an index, that's quite faster.

Now the profiling result is clean:
16394266 main
 8079735 2performance510FastRandom7uniformMFddZd
 2841876 2performance510FastRandom8randUintMFZk
 1923377 2performance54Yzlm20userReputationUpdateMFKAS12performance56RatingKAdKAdZv
 1729333 2performance54Yzlm22objectReputationUpdateMFKAS12performance56RatingKAdKAdZd

Most of the time is spent in the main and generating rnd values, but the timings are clean.

The resulting code can be slower than the C++ code, but now it's mostly a matter of backend. If I port this code to D1 for Tango and I compile it very well with LDC the performance can be about as good as the C++ version or better :-) If necessary I can use an even faster random number generator, but that's for tomorrow, if you want.


import std.stdio: printf;
import std.math: sqrt, pow;

struct FastRandom {
    uint kiss_x = 1;
    uint kiss_y = 2;
    uint kiss_z = 4;
    uint kiss_w = 8;
    uint kiss_carry = 0;
    uint kiss_k, kiss_m;

    this(uint seed) {
        this.seed(seed);
    }

    void seed(uint seed) {
        kiss_x = seed | 1;
        kiss_y = seed | 2;
        kiss_z = seed | 4;
        kiss_w = seed | 8;
        kiss_carry = 0;
    }

    uint randUint() {
        kiss_x = kiss_x * 69069 + 1;
        kiss_y ^= kiss_y << 13;
        kiss_y ^= kiss_y >> 17;
        kiss_y ^= kiss_y << 5;
        kiss_k = (kiss_z >> 2) + (kiss_w >> 3) + (kiss_carry >> 2);
        kiss_m = kiss_w + kiss_w + kiss_z + kiss_carry;
        kiss_z = kiss_w;
        kiss_w = kiss_m;
        kiss_carry = kiss_k >> 30;
        return kiss_x + kiss_y + kiss_w;
    }

    double random() {
        return this.randUint() / (uint.max + 1.0);
    }

    double uniform(double a, double b) {
        double r = cast(double)this.randUint() / (uint.max + 1.0);
        return a + (b - a) * r;
    }
}


struct Rating {
    uint user, object;
    double value;
}


abstract class ReputationAlgorithm {
    this() {}
}


final class Yzlm : ReputationAlgorithm {
    double beta;
    double convergenceRequirement;
    double errorMin;
//    uint[] userLinks; // commented out because for now it
                        // has a constant value for all users
    double[] weightSum;
    double[] oldReputationObject;

    double objectReputationUpdate(ref Rating[] ratings,
                                  ref double[] reputationUser,
                                  ref double[] reputationObject) {
        double diff = 0;
        double[] temp = oldReputationObject;

        // Original version had:
        //
        //    oldReputationObject[] = reputationObject[]
        //
        // This version is an attempt to save effort
        // by just switching round the memory the two
        // arrays are pointing at -- not sure if it
        // actually does what I'm expecting it to.
        // Doesn't seem to improve speed. :-(
        oldReputationObject = reputationObject;
        reputationObject = temp;
        reputationObject[] = 0;

        weightSum[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user] * r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint object, ref double r; reputationObject) {
            r /= (weightSum[object] > 0) ? weightSum[object] : 1;
            auto aux = (r - oldReputationObject[object]);
            diff += aux * aux;
        }

        return sqrt(diff);
    }

    void userReputationUpdate(ref Rating[] ratings,
                              ref double[] reputationUser,
                              ref double[] reputationObject) {
        reputationUser[] = 0;

        foreach (Rating r; ratings) {
            auto aux = (r.value - reputationObject[r.object]);
            reputationUser[r.user] += aux * aux;
        }

        foreach (uint user, ref double r; reputationUser) {
            //if(userLinks[user]>0)
                r = pow( (r/reputationObject.length/*userLinks[user]*/) + errorMin, -beta);
        }
    }

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
    //    userLinks.length = reputationUser.length;
    //    userLinks[] = 0;

        weightSum.length = reputationObject.length;
        oldReputationObject.length = reputationObject.length;

    //    foreach (Rating r; ratings)
    //        userLinks[r.user]++;

        double diff;
        uint iterations = 0;
        do {
            userReputationUpdate(ratings, reputationUser, reputationObject);
            diff = objectReputationUpdate(ratings, reputationUser, reputationObject);
            ++iterations;
        } while (diff > convergenceRequirement);
        printf("Exited in %u iterations with diff = %g < %g\n",
               iterations, diff, convergenceRequirement);
    }

    this() {}

    this(double b, double c, double e) {
        beta = b;
        convergenceRequirement = c;
        errorMin = e;
        assert(beta >= 0);
        assert(convergenceRequirement > 0);
        assert(errorMin >= 0);
    }

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject,
         double b, double c, double e) {
        this(b,c,e);

        opCall(ratings, reputationUser, reputationObject);
    }
}


class AvgWeighted : ReputationAlgorithm {
    double[] weightSum;

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
        weightSum.length = reputationObject.length;
        weightSum[] = 0;

        reputationObject[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user]*r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint o, ref double r; reputationObject)
            r /= weightSum[o];
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


final class AvgArithmetic : AvgWeighted {
    override void opCall(ref Rating[] ratings,
            ref double[] reputationUser,
            ref double[] reputationObject) {
        reputationUser[] = 1;
        super.opCall(ratings, reputationUser, reputationObject);
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


void main() {
    Rating[] ratings;
    double[] reputationObject, reputationUser;
    double[] objectQuality, userError;
    auto aa = new AvgArithmetic;
    auto yzlm = new Yzlm(0.8, 1e-12, 1e-36);

    auto rnd = FastRandom(1001);

    reputationObject.length = 1_000;
    reputationUser.length = 1_000;

    objectQuality.length = reputationObject.length;
    userError.length = reputationUser.length;

    ratings.length = reputationObject.length * reputationUser.length;

    foreach (uint i; 0 .. 100) {
        foreach (ref double Q; objectQuality)
            Q = rnd.uniform(0.0, 10.0);

        foreach (ref double sigma2; userError)
            sigma2 = rnd.random();

        int pos;
        foreach (uint object, ref double Q; objectQuality)
            foreach (uint user, ref double sigma2; userError)
                ratings[pos++] = Rating(user, object, rnd.uniform(Q - sigma2, Q + sigma2));

        printf("We now have %u ratings.\n", ratings.length);

        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);

        double deltaQ = 0;
        foreach (uint object, double r; reputationObject) {
            auto aux = (r - objectQuality[object]);
            deltaQ += aux * aux;
        }
        deltaQ = sqrt(deltaQ / reputationObject.length);
        printf("[%u] Error in quality estimate: %g\n", i, deltaQ);
    }
}


Bye,
bearophile
April 14, 2010
This compiles for D1. With LDC it runs about two times faster still, about 4.5-5 times faster than the original version. We can compare this to the C++ version of yours.

ldc -O5 -release -inline -enable-unsafe-fp-math -unroll-allow-partial test.d

I have also tried the faster rnd generator (R250-521), but in this program it's not faster than Kiss, so I've kept Kiss. Note that Tango that you can use with LDC has Kiss already.


version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.math.Math: sqrt, pow;
} else {
    import std.stdio: printf;
    import std.math: sqrt, pow;
}

struct FastRandom {
    uint kiss_x = 1;
    uint kiss_y = 2;
    uint kiss_z = 4;
    uint kiss_w = 8;
    uint kiss_carry = 0;
    uint kiss_k, kiss_m;

    void seed(uint seed) {
        kiss_x = seed | 1;
        kiss_y = seed | 2;
        kiss_z = seed | 4;
        kiss_w = seed | 8;
        kiss_carry = 0;
    }

    uint randUint() {
        kiss_x = kiss_x * 69069 + 1;
        kiss_y ^= kiss_y << 13;
        kiss_y ^= kiss_y >> 17;
        kiss_y ^= kiss_y << 5;
        kiss_k = (kiss_z >> 2) + (kiss_w >> 3) + (kiss_carry >> 2);
        kiss_m = kiss_w + kiss_w + kiss_z + kiss_carry;
        kiss_z = kiss_w;
        kiss_w = kiss_m;
        kiss_carry = kiss_k >> 30;
        return kiss_x + kiss_y + kiss_w;
    }

    double random() {
        return this.randUint() / (uint.max + 1.0);
    }

    double uniform(double a, double b) {
        double r = cast(double)this.randUint() / (uint.max + 1.0);
        return a + (b - a) * r;
    }
}


struct Rating {
    uint user, object;
    double value;
}


abstract class ReputationAlgorithm {
    this() {}
}


final class Yzlm : ReputationAlgorithm {
    double beta;
    double convergenceRequirement;
    double errorMin;
//    uint[] userLinks; // commented out because for now it
                        // has a constant value for all users
    double[] weightSum;
    double[] oldReputationObject;

    double objectReputationUpdate(ref Rating[] ratings,
                                  ref double[] reputationUser,
                                  ref double[] reputationObject) {
        double diff = 0;
        double[] temp = oldReputationObject;

        // Original version had:
        //
        //    oldReputationObject[] = reputationObject[]
        //
        // This version is an attempt to save effort
        // by just switching round the memory the two
        // arrays are pointing at -- not sure if it
        // actually does what I'm expecting it to.
        // Doesn't seem to improve speed. :-(
        oldReputationObject = reputationObject;
        reputationObject = temp;
        reputationObject[] = 0;

        weightSum[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user] * r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint object, ref double r; reputationObject) {
            r /= (weightSum[object] > 0) ? weightSum[object] : 1;
            auto aux = (r - oldReputationObject[object]);
            diff += aux * aux;
        }

        return sqrt(diff);
    }

    void userReputationUpdate(ref Rating[] ratings,
                              ref double[] reputationUser,
                              ref double[] reputationObject) {
        reputationUser[] = 0;

        foreach (Rating r; ratings) {
            auto aux = (r.value - reputationObject[r.object]);
            reputationUser[r.user] += aux * aux;
        }

        foreach (uint user, ref double r; reputationUser) {
            //if(userLinks[user]>0)
                r = pow( (r/reputationObject.length/*userLinks[user]*/) + errorMin, -beta);
        }
    }

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
    //    userLinks.length = reputationUser.length;
    //    userLinks[] = 0;

        weightSum.length = reputationObject.length;
        oldReputationObject.length = reputationObject.length;

    //    foreach (Rating r; ratings)
    //        userLinks[r.user]++;

        double diff;
        uint iterations = 0;
        do {
            userReputationUpdate(ratings, reputationUser, reputationObject);
            diff = objectReputationUpdate(ratings, reputationUser, reputationObject);
            ++iterations;
        } while (diff > convergenceRequirement);
        printf("Exited in %u iterations with diff = %g < %g\n",
               iterations, diff, convergenceRequirement);
    }

    this() {}

    this(double b, double c, double e) {
        beta = b;
        convergenceRequirement = c;
        errorMin = e;
        assert(beta >= 0);
        assert(convergenceRequirement > 0);
        assert(errorMin >= 0);
    }

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject,
         double b, double c, double e) {
        this(b,c,e);

        opCall(ratings, reputationUser, reputationObject);
    }
}


class AvgWeighted : ReputationAlgorithm {
    double[] weightSum;

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
        weightSum.length = reputationObject.length;
        weightSum[] = 0;

        reputationObject[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user]*r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint o, ref double r; reputationObject)
            r /= weightSum[o];
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


final class AvgArithmetic : AvgWeighted {
    override void opCall(ref Rating[] ratings,
            ref double[] reputationUser,
            ref double[] reputationObject) {
        reputationUser[] = 1;
        super.opCall(ratings, reputationUser, reputationObject);
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


void main() {
    Rating[] ratings;
    double[] reputationObject, reputationUser;
    double[] objectQuality, userError;
    auto aa = new AvgArithmetic;
    auto yzlm = new Yzlm(0.8, 1e-12, 1e-36);

    FastRandom rnd;
    rnd.seed(1001);

    reputationObject.length = 1_000;
    reputationUser.length = 1_000;

    objectQuality.length = reputationObject.length;
    userError.length = reputationUser.length;

    ratings.length = reputationObject.length * reputationUser.length;

    for (uint i; i < 4; i++) { // 100  4  ***************
        foreach (ref double Q; objectQuality)
            Q = rnd.uniform(0.0, 10.0);

        foreach (ref double sigma2; userError)
            sigma2 = rnd.random();

        int pos;
        foreach (uint object, ref double Q; objectQuality)
            foreach (uint user, ref double sigma2; userError)
                ratings[pos++] = Rating(user, object, rnd.uniform(Q - sigma2, Q + sigma2));

        printf("We now have %u ratings.\n", ratings.length);

        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);

        double deltaQ = 0;
        foreach (uint object, double r; reputationObject) {
            auto aux = (r - objectQuality[object]);
            deltaQ += aux * aux;
        }
        deltaQ = sqrt(deltaQ / reputationObject.length);
        printf("[%u] Error in quality estimate: %g\n", i, deltaQ);
    }
}

Bye,
bearophile
April 14, 2010
bearophile wrote:
> So far I've just given a light reading of the code. Notes:

> - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Note that x^^2 doesn't require import std.math. (The fact that x ^^ 5 requires import std.math is a temporary limitation).