Jump to page: 1 24  
Page
Thread overview
Floating Point + Threads?
Apr 16, 2011
dsimcha
Apr 16, 2011
Walter Bright
Apr 16, 2011
Iain Buclaw
Apr 16, 2011
Walter Bright
Apr 16, 2011
Iain Buclaw
Apr 16, 2011
Walter Bright
Apr 16, 2011
dsimcha
Apr 16, 2011
Robert Jacques
Apr 16, 2011
dsimcha
Apr 16, 2011
dsimcha
Apr 16, 2011
dsimcha
Apr 16, 2011
Walter Bright
Apr 16, 2011
Iain Buclaw
Apr 16, 2011
dsimcha
Apr 16, 2011
JimBob
Apr 16, 2011
Timon Gehr
Apr 16, 2011
dsimcha
Apr 16, 2011
dsimcha
Apr 16, 2011
Sean Kelly
Apr 16, 2011
Walter Bright
Apr 16, 2011
bearophile
Apr 16, 2011
Robert Jacques
Apr 19, 2011
Sean Kelly
Apr 20, 2011
Robert Jacques
Apr 20, 2011
Don
Apr 20, 2011
Sean Kelly
Apr 20, 2011
JimBob
Apr 20, 2011
Sean Kelly
Apr 20, 2011
Don
April 16, 2011
I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism.  The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel.  It uses taskPool.reduce, which divides the summation into work units to be executed in parallel.  When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true:

1.  The work is divided into work units in a deterministic fashion.

2.  Within each work unit, the summation happens in a deterministic order.

3.  The final summation of the results of all the work units is done in a deterministic order.

4.  The smallest term in the summation is about 5e-10.  This means the difference across runs is about two orders of magnitude smaller than the smallest term.  It can't be a concurrency bug where some terms sometimes get skipped.

5.  The results for the individual tasks, not just the final summation, differ in the low-order bits.  Each task is executed in a single thread.

6.  The rounding mode is apparently the same in all of the threads.

7.  The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to >0.  Since it's a single core machine, it can't be a low level memory model issue.

What could possibly cause such small, non-deterministic differences in floating point results, given everything above?  I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this.
April 16, 2011
On 4/15/11 10:22 PM, dsimcha wrote:
> I'm trying to debug an extremely strange bug whose symptoms appear in a
> std.parallelism example, though I'm not at all sure the root cause is in
> std.parallelism. The bug report is at
> https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

Does the scheduling affect the summation order?

Andrei
April 16, 2011
On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha@yahoo.com> wrote:

> I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism.  The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
>
> Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel.  It uses taskPool.reduce, which divides the summation into work units to be executed in parallel.  When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true:
>
> 1.  The work is divided into work units in a deterministic fashion.
>
> 2.  Within each work unit, the summation happens in a deterministic order.
>
> 3.  The final summation of the results of all the work units is done in a deterministic order.
>
> 4.  The smallest term in the summation is about 5e-10.  This means the difference across runs is about two orders of magnitude smaller than the smallest term.  It can't be a concurrency bug where some terms sometimes get skipped.
>
> 5.  The results for the individual tasks, not just the final summation, differ in the low-order bits.  Each task is executed in a single thread.
>
> 6.  The rounding mode is apparently the same in all of the threads.
>
> 7.  The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to >0.  Since it's a single core machine, it can't be a low level memory model issue.
>
> What could possibly cause such small, non-deterministic differences in floating point results, given everything above?  I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this.

Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different.
As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation.
April 16, 2011
On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
> On 4/15/11 10:22 PM, dsimcha wrote:
>> I'm trying to debug an extremely strange bug whose symptoms appear in a
>> std.parallelism example, though I'm not at all sure the root cause is in
>> std.parallelism. The bug report is at
>> https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
>
> Does the scheduling affect the summation order?

That's a good thought. FP addition results can differ dramatically depending on associativity.
April 16, 2011
== Quote from Walter Bright (newshound2@digitalmars.com)'s article
> On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
> > On 4/15/11 10:22 PM, dsimcha wrote:
> >> I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
> >
> > Does the scheduling affect the summation order?
> That's a good thought. FP addition results can differ dramatically depending on associativity.

And not to forget optimisations too. ;)
April 16, 2011
On 4/15/2011 11:40 PM, Andrei Alexandrescu wrote:
> On 4/15/11 10:22 PM, dsimcha wrote:
>> I'm trying to debug an extremely strange bug whose symptoms appear in a
>> std.parallelism example, though I'm not at all sure the root cause is in
>> std.parallelism. The bug report is at
>> https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
>> .
>
> Does the scheduling affect the summation order?
>
> Andrei

No.  I realize floating point addition isn't associative, but unless there's some detail I'm forgetting about, the ordering is deterministic within each work unit and the ordering of the final summation is deterministic.
April 16, 2011
On 4/16/2011 2:09 AM, Robert Jacques wrote:
> On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha@yahoo.com> wrote:
>
>> I'm trying to debug an extremely strange bug whose symptoms appear in
>> a std.parallelism example, though I'm not at all sure the root cause
>> is in std.parallelism. The bug report is at
>> https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
>> .
>>
>> Basically, the example in question sums up all the elements of a lazy
>> range (actually, std.algorithm.map) in parallel. It uses
>> taskPool.reduce, which divides the summation into work units to be
>> executed in parallel. When executed in parallel, the results of the
>> summation are non-deterministic after about the 12th decimal place,
>> even though all of the following properties are true:
>>
>> 1. The work is divided into work units in a deterministic fashion.
>>
>> 2. Within each work unit, the summation happens in a deterministic order.
>>
>> 3. The final summation of the results of all the work units is done in
>> a deterministic order.
>>
>> 4. The smallest term in the summation is about 5e-10. This means the
>> difference across runs is about two orders of magnitude smaller than
>> the smallest term. It can't be a concurrency bug where some terms
>> sometimes get skipped.
>>
>> 5. The results for the individual tasks, not just the final summation,
>> differ in the low-order bits. Each task is executed in a single thread.
>>
>> 6. The rounding mode is apparently the same in all of the threads.
>>
>> 7. The bug appears even on machines with only one core, as long as the
>> number of task pool threads is manually set to >0. Since it's a single
>> core machine, it can't be a low level memory model issue.
>>
>> What could possibly cause such small, non-deterministic differences in
>> floating point results, given everything above? I'm just looking for
>> suggestions here, as I don't even know where to start hunting for a
>> bug like this.
>
> Well, on one hand floating point math is not cumulative, and running
> sums have many known issues (I'd recommend looking up Khan summation).
> On the hand, it should be repeatably different.
> As for suggestions? First and foremost, you should always add small to
> large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should
> the answer be better, but if your error rate goes down, you have a good
> idea of where the problem is. I'd also try isolating your
> implementation's numerics, from the underlying concurrency. i.e. use a
> task pool of 1 and don't let the host thread join it, so the entire job
> is done by one worker. The other thing to try is isolation /removing map
> and iota from the equation.

Right.  For this example, though, assuming floating point math behaves like regular math is a good enough approximation.  The issue isn't that the results aren't reasonably accurate.  Furthermore, the results will always change slightly depending on how many work units you have.  (I even warn in the documentation that floating point addition is not associative, though it is approximately associative in the well-behaved cases.)

My only concern is whether this non-determinism represents some deep underlying bug.  For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all.  If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz.  Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation.
April 16, 2011
On 4/16/2011 10:11 AM, dsimcha wrote:
> My only concern is whether this non-determinism represents some deep
> underlying bug. For a given work unit allocation (work unit allocations
> are deterministic and only change when the number of threads changes or
> they're changed explicitly), I can't figure out how scheduling could
> change the results at all. If I could be sure that it wasn't a symptom
> of an underlying bug in std.parallelism, I wouldn't care about this
> small amount of numerical fuzz. Floating point math is always inexact
> and parallel summation by its nature can't be made to give the exact
> same results as serial summation.

Ok, it's definitely **NOT** a bug in std.parallelism.  Here's a reduced test case that only uses core.thread, not std.parallelism.  All it does is sum an array using std.algorithm.reduce from the main thread, then start a new thread to do the same thing and compare answers.  At the beginning of the summation function the rounding mode is printed to verify that it's the same for both threads.  The two threads get slightly different answers.

Just to thoroughly rule out a concurrency bug, I didn't even let the two threads execute concurrently.  The main thread produces its result, then starts and immediately joins the second thread.

import std.algorithm, core.thread, std.stdio, core.stdc.fenv;

real sumRange(const(real)[] range) {
    writeln("Rounding mode:  ", fegetround);  // 0 from both threads.
    return reduce!"a + b"(range);
}

void main() {
    immutable n = 1_000_000;
    immutable delta = 1.0 / n;

    auto terms = new real[1_000_000];
    foreach(i, ref term; terms) {
        immutable x = ( i - 0.5 ) * delta;
        term = delta / ( 1.0 + x * x ) * 1;
    }

    immutable res1 = sumRange(terms);
    writefln("%.19f", res1);

    real res2;
    auto t = new Thread( { res2 = sumRange(terms); } );
    t.start();
    t.join();
    writefln("%.19f", res2);
}


Output:
Rounding mode:  0
0.7853986633972191094
Rounding mode:  0
0.7853986633972437348
April 16, 2011
On 4/16/11 9:52 AM, dsimcha wrote:
> Output:
> Rounding mode: 0
> 0.7853986633972191094
> Rounding mode: 0
> 0.7853986633972437348

I think at this precision the difference may be in random bits. Probably you don't need to worry about it.

Andrei
April 16, 2011
On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:
> On 4/16/11 9:52 AM, dsimcha wrote:
>> Output:
>> Rounding mode: 0
>> 0.7853986633972191094
>> Rounding mode: 0
>> 0.7853986633972437348
>
> I think at this precision the difference may be in random bits. Probably
> you don't need to worry about it.
>
> Andrei

"random bits"?  I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective.  I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless.  Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD.
« First   ‹ Prev
1 2 3 4