April 14, 2010
On 14/04/10 20:54, Don wrote:
> I have a vague recollection that correctly-rounded pow() will require
> bigint (can't quite remember, though). I'm also concerned about build
> tools -- I don't want them to have to know about the dependency.
> As a bare minimum, the error message will need to improve (with some
> explanation of why std.math is required, to reduce the WTF factor).
> But in any case, it's a very minor issue.
> Personally, I'm finding that having ^^ as standard nomenclature is
> extremely useful, even apart from the use in code.

I've always disliked the requirement to import std.math, but only because of it's lack of flexibility... Surely it would be better to require the pow() function to be defined somewhere, this way if/when tango is ported to D2 it doesn't have to hack around this to allow it to allow users to use ^^.
April 14, 2010
Robert Clipsham Wrote:

> this way if/when tango is ported to D2 it doesn't have to hack around this to allow it to allow users to use ^^.

I hope Tango2 will be designed to be installed beside Phobos2, and not in place of it.

Bye,
bearophile
April 15, 2010
On Wed, 14 Apr 2010 05:58:55 -0400, bearophile <bearophileHUGS@lycos.com> wrote:
> 
> Don:
> 
> >Raising to a float power is really a niche feature.<
> 
> Used it as x^^0.5 to perform the square root is not a niche feature, square roots are common enough. And sqrt is an intrinsic, it doesn't need all std.math.

I think I've only ever used a square root once, and that was to get the distance between two points. I'm not sure that square roots are common enough in general code to justify an implicit import (especially if we think of D in its context as a systems programming language).
April 15, 2010
Hi Bearophile,

Thanks ever so much for the amount of effort you put into this -- it is extremely kind and generous. :-)

I'm not seeing the significant gains that you see with DMD, but for me too LDC doubles the speed.

My guess for the difference is that you ran less than the full 100 iterations of the main for loop.  As you can see in the code the Yzlm class launches an iterative procedure which takes more or less iterations to converge depending on initial conditions.

This _normally_ is relatively few, but can in some circumstances be
large (several hundred).

An effect of changing the random number generator is that it changes the initial conditions presented to the algorithm.  For example, in the first 10 runs of the loop, the original code performs 404 iterations overall and your new code performs only 348.

Combine that with other savings like the removal of the appender and the faster random number generation and you get a fairly sizeable saving -- about a 25-30% cut in running time when compiling with dmd.  But those savings become less of an influence on running time as you increase the number of simulations.

The _real_ measure of speed is thus not overall program running time but running time relative to the total number of iterations.  The good news is that while the LDC-compiled version doesn't beat g++ it comes pretty close, with about 32 iterations per second for the D code compared to 26 for the C++. :-)

The C++ code can probably be optimised a bit further but not too much; the algorithms in the iterative process are pretty much identical between the two versions.  (The C++ doesn't have the 'clever' memory swap-around that I put in the D code, but that doesn't make any difference to D's performance anyway...)  So I think it's probably just compiler difference that's to blame for speed differences -- which is fine.  After all, D in general and DMD in particular is in development.

I _am_ surprised that in your profiling the random number generator took up so much time, as if you cut out the lines,

        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);

... and recompile, the program screams through in no time at all -- which would lead me to think that it's the yzlm opCall and the functions it calls that take up the majority of time.  Or is there some lazy evaluation going on here ... ?

I don't think it's the algorithm in any case (although it might be Phobos' implementation) as working with Tango and LDC, changing between the Twister or Kiss random number generators doesn't make for any meaningful performance difference.

One last thing.  While comparing DMD and LDC I noticed something odd. Take this code, which is an LDC and DMD-compatible version of the 'array leak' code I posted originally:

/////////////////////////////////////////////////////////////////////////
version(Tango) {
    import tango.stdc.stdio: printf;
} else {
    import std.stdio: printf;
}

void main()
{
    double[] x;

    for(uint i=0;i<100;++i) {
        x.length = 0;

        for(uint j=0;j<5_000;++j) {
            for(uint k=0;k<1_000;++k) {
                x ~= j*k;
            }
        }

        printf("At iteration %u, x has %u elements.\n",i,x.length);
    }
}
/////////////////////////////////////////////////////////////////////////

I noticed that when compiled with LDC the memory usage stays constant, but in DMD the memory use blows up.  Is this a D1/D2 difference (with D2 having the more advanced features that require preservation of memory) or is it a bug in one or the other of the compilers ... ?

Thanks & best wishes,

    -- Joe

April 15, 2010
bearophile wrote:
> Robert Clipsham Wrote:
> 
>> this way if/when tango is ported to D2 it doesn't have to hack around this to allow it to allow users to use ^^.
> 
> I hope Tango2 will be designed to be installed beside Phobos2, and not in place of it.

I thought that was the aim ... ?  At least according to D's Wikipedia page: http://en.wikipedia.org/wiki/D_%28programming_language%29#Division_concerning_the_standard_library
April 15, 2010
On Wed, 14 Apr 2010 17:19:53 -0400, Joseph Wakeling <joseph.wakeling@webdrake.net> wrote:


> /////////////////////////////////////////////////////////////////////////
> version(Tango) {
>     import tango.stdc.stdio: printf;
> } else {
>     import std.stdio: printf;
> }
>
> void main()
> {
>     double[] x;
>
>     for(uint i=0;i<100;++i) {
>         x.length = 0;
>
>         for(uint j=0;j<5_000;++j) {
>             for(uint k=0;k<1_000;++k) {
>                 x ~= j*k;
>             }
>         }
>
>         printf("At iteration %u, x has %u elements.\n",i,x.length);
>     }
> }
> /////////////////////////////////////////////////////////////////////////
>
> I noticed that when compiled with LDC the memory usage stays constant,
> but in DMD the memory use blows up.  Is this a D1/D2 difference (with D2
> having the more advanced features that require preservation of memory)
> or is it a bug in one or the other of the compilers ... ?

It is because D2 is more advanced in preserving memory.  In D1, setting length to zero is equivalent to also assuming it can append safely.

One thing I forgot about, there is currently a design flaw in D2 array appending which will keep around a certain number of arrays even after all references have been removed (i.e. the arrays should be collectable, but they are not collected by the GC).  I have a clear path on how to fix that, but I haven't done it yet.

See this bugzilla report: http://d.puremagic.com/issues/show_bug.cgi?id=3929

This might also help explain why the memory blows up.

-Steve
April 15, 2010
Joseph Wakeling:

>My guess for the difference is that you ran less than the full 100 iterations of the main for loop.<

You are right, sorry.


>So I think it's probably just compiler difference that's to blame for speed differences<

This can be true, but such differences are not magic, they can be found, and eventually they can even become precise enhancement requests for llvm developers, they are open to such requests.


>After all, D in general and DMD in particular is in development.<

DMD has a quite old back-end that is not in active development. Maybe it will become 64 bit.


>I _am_ surprised that in your profiling the random number generator took up so much time, as if you cut out the lines,
        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);
... and recompile, the program screams through in no time at all -- which would lead me to think that it's the yzlm opCall and the functions it calls that take up the majority of time.  Or is there some lazy evaluation going on here ... ?<

I've done few more tests, an I've seen that my profiling in DMD was wrong, I have disabled the inlining (to have a less aggregated result), but this has produced fake results, because in the true optimized run the inlining removes most of the overhead of the Kiss generator, so the true profiling is now like you say:

    Func
    Time

17703021  Yzlm.objectReputationUpdate
12709135  Yzlm.userReputationUpdate
 1385194  main


In this program I have seen that a good percentage of the speed difference between ldc/dmd is caused by foreach loops. When you iterate over structs longer than size_t use ref (or in D2 better ref const(...)).
foreach (Rating r; ratings)
==>
foreach (ref const(Rating) r; ratings)

There is nothing algorithmically strange going on here, but to be sure you can take a look at the produced asm.


This is the middle parts (the two loops) of Yzlm.objectReputationUpdate compiled with ldc, this is one of the two main hot spots of the program, the you can compared this asm with the asm of the same part from the C++ version:

...
.LBB6_2:
    movl    (%eax), %edx
    movl    52(%esp), %edi
    movl    4(%edi), %ebx
    movsd   (%ebx,%edx,8), %xmm0
    mulsd   8(%eax), %xmm0
    movl    4(%eax), %edx
    movl    48(%esp), %ebx
    movl    4(%ebx), %ebx
    addsd   (%ebx,%edx,8), %xmm0
    movsd   %xmm0, (%ebx,%edx,8)
    movl    4(%eax), %edx
    movl    36(%esi), %ebx
    movsd   (%ebx,%edx,8), %xmm0
    movl    (%eax), %ebp
    movl    4(%edi), %edi
    addsd   (%edi,%ebp,8), %xmm0
    movsd   %xmm0, (%ebx,%edx,8)
    addl    $16, %eax
    decl    %ecx
    jne .LBB6_2
.LBB6_3:
    movl    48(%esp), %eax
    movl    (%eax), %eax
    testl   %eax, %eax
    je  .LBB6_9
    movl    48(%esp), %ecx
    movl    4(%ecx), %ecx
    pxor    %xmm0, %xmm0
    xorl    %edx, %edx
    pxor    %xmm1, %xmm1
    .align  16
.LBB6_5:
    movl    36(%esi), %edi
    movsd   (%edi,%edx,8), %xmm2
    ucomisd %xmm1, %xmm2
    ja  .LBB6_7
    movsd   .LCPI6_0, %xmm2
.LBB6_7:
    movsd   (%ecx,%edx,8), %xmm3
    divsd   %xmm2, %xmm3
    movsd   %xmm3, (%ecx,%edx,8)
    movl    44(%esi), %edi
    subsd   (%edi,%edx,8), %xmm3
    mulsd   %xmm3, %xmm3
    addsd   %xmm3, %xmm0
    incl    %edx
    cmpl    %eax, %edx
    jne .LBB6_5
...



I've also seen that the program gets a little faster with some unrollig, done like this:

        const int UNROLL = 5;
        assert(!(ratings.length % UNROLL));
        assert(!(reputationUser.length % UNROLL));

        for (int i; i < ratings.length; i += UNROLL) {
            foreach (k; Range!(UNROLL)) {
                auto r = &(ratings[i + k]);
                auto aux = (r.value - reputationObject[r.object]);
                reputationUser[r.user] += aux * aux;
            }
        }


Where:

template Tuple(T...) {
    alias T Tuple;
}

template Range(int stop) {
    static if (stop <= 0)
        alias Tuple!() Range;
    else
        alias Tuple!(Range!(stop-1), stop-1) Range;
}


But to make it work with a arrays of arbitrary length you need a little more code, using a strategy similar to:

sum = 0;
for (i = 0; i < n - 3; i += 4)
  sum += A[i] + A[i+1] + A[i+2] + A[i+3];
for (; i < n; i++)
  sum += A[i];


>I don't think it's the algorithm in any case (although it might be Phobos' implementation) as working with Tango and LDC, changing between the Twister or Kiss random number generators doesn't make for any meaningful performance difference.<

Those Phobos and Tango implementations, when you use uniform(), are probably different, there are enforce() and the nextafter() that are probably not used in that Tango code.


>I noticed that when compiled with LDC the memory usage stays constant, but in DMD the memory use blows up.  Is this a D1/D2 difference (with D2 having the more advanced features that require preservation of memory) or is it a bug in one or the other of the compilers ... ?<

D2 dynamic arrays have being changed recently, D1 dynamic arrays have not changed. LDC is D1 still. So this is just a D1/D2 difference. The purpose of this change between D1 and D2 was to avoid bugs caused by "memory stomping" of array slices.

Bye,
bearophile
April 15, 2010
Don wrote:
> Lars T. Kyllingstad wrote:
>> Don wrote:
>>> bearophile wrote:
>>>> So far I've just given a light reading of the code. Notes:
>>>
>>>> - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you have to import std.math anyway, because of a bug).
>>>
>>> That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.
>>
>> Really?  Why is that?  I find that kind of disappointing, I always believed it to be a temporary solution.
>>
>> I think the inconsistency with the other operators will make this a major WTF for people new to the language.  Why should a^^b require an explicit import while a*b doesn't?
> 
> Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently.
> To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Is there a better way to do pow() for floating point numbers without importing std.math?

I see this:

1. You want to do x ^^ fp.
2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module".

Alternative 1 for user:
User says "Ok, I import std.math"

Alternative 2 for user:
User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"
April 15, 2010
> This is the middle parts (the two loops) of Yzlm.objectReputationUpdate compiled with ldc, this is one of the two main hot spots of the program, the you can compared this asm with the asm of the same part from the C++ version:

To find why the C++ code is faster you can show me the equivalent C++ code that I can compile myself, or you can compile your C++ code and show me the asm of the two functions that are hot spots.

Bye,
bearophile
April 15, 2010
Ary Borenszweig wrote:
> Don wrote:
>> Lars T. Kyllingstad wrote:
>>> Don wrote:
>>>> bearophile wrote:
>>>>> So far I've just given a light reading of the code. Notes:
>>>>
>>>>> - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you have to import std.math anyway, because of a bug).
>>>>
>>>> That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.
>>>
>>> Really?  Why is that?  I find that kind of disappointing, I always believed it to be a temporary solution.
>>>
>>> I think the inconsistency with the other operators will make this a major WTF for people new to the language.  Why should a^^b require an explicit import while a*b doesn't?
>>
>> Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently.
>> To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.
> 
> Is there a better way to do pow() for floating point numbers without importing std.math?
> 
> I see this:
> 
> 1. You want to do x ^^ fp.
> 2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module".
> 
> Alternative 1 for user:
> User says "Ok, I import std.math"
> 
> Alternative 2 for user:
> User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"

That's a good point, it should be possible to use a static import as well. I do think it's pretty odd to be doing floating point without importing std.math, though. I mean, abs() is absolutely fundamental.