Thread overview
[phobos] FreeBSD segfaults with std.parallelism
Apr 29, 2011
David Simcha
Apr 29, 2011
David Simcha
Apr 29, 2011
Don Clugston
May 01, 2011
David Simcha
Apr 29, 2011
Rainer Schuetze
April 29, 2011
I've spent some serious time looking into the FreeBSD std.parallelism segfaults.  I'm at a complete loss as to what could be causing them or how to fix them.  Here are some observations.  Someone please offer any suggestions you have.

1.  I'm able to reproduce these, though much more sporadically, on Windows and Linux, by executing the unit test in a loop.

2.  On FreeBSD running GDB on the core dump shows stack traces that should be impossible.  Every time the program crashes, the function at the top of the stack should be unreachable from the second function from the top.  (It shouldn't even be indirectly reachable, i.e. inlining couldn't explain it.)  On both Linux and FreeBSD, the program counter ends up at illegal places in between instructions.  Even more weirdly, the address that the program counter ends up at when the segfault happens seems deterministic for any given platform and compiler settings.  Is there a good debugger for Windows that will give me stack traces and stuff like GDB?

3.  The triggering test is:

     auto lmchain = poolInstance.map!"a * a"(
         poolInstance.map!sqrt(
             poolInstance.asyncBuf(
                 iota(3_000_000)
             )
         )
     );
     foreach(i, elem; parallel(lmchain)) {
         assert(approxEqual(elem, i));
     }

In other words, it's the test that uses everything together (including Task and amap() under the hood), the hardest one to debug.

IIUC, the instruction stream can't be overwritten by a buggy program because the code pages are marked read-only.  The only other explanation I can think of for how the program counter could be corrupted is if some race condition corrupts either a function pointer or a return address on the stack.  However, in this case the address that the program counter ends up at when the segfault happens should be less deterministic.

April 29, 2011
I'll mention that I just was debugging some stuff in Linux for dcollections, and the line numbers for the unit tests given by gdb were off by a couple of lines (I had the same feelings you are having, how is this possible).? That was just inside the unit tests.? In the actual code, the line numbers were correct.? What's more, when I would step over lines in gdb, the program would jump *back* some lines when there was no loop in the code.

I wouldn't trust that GDB is giving you accurate info.


My recommendation -- try using writeln debugging if possible.

-Steve




>________________________________
>From: David Simcha <dsimcha at gmail.com>
>To: Discuss the phobos library for D <phobos at puremagic.com>
>Sent: Friday, April 29, 2011 9:26 AM
>Subject: Re: [phobos] FreeBSD segfaults with std.parallelism
>
>I've spent some serious time looking into the FreeBSD std.parallelism segfaults.? I'm at a complete loss as to what could be causing them or how to fix them.? Here are some observations.? Someone please offer any suggestions you have.
>
>1.? I'm able to reproduce these, though much more sporadically, on Windows and Linux, by executing the unit test in a loop.
>
>2.? On FreeBSD running GDB on the core dump shows stack traces that should be impossible.? Every time the program crashes, the function at the top of the stack should be unreachable from the second function from the top.? (It shouldn't even be indirectly reachable, i.e. inlining couldn't explain it.)? On both Linux and FreeBSD, the program counter ends up at illegal places in between instructions.? Even more weirdly, the address that the program counter ends up at when the segfault happens seems deterministic for any given platform and compiler settings.? Is there a good debugger for Windows that will give me stack traces and stuff like GDB?
>
>3.? The triggering test is:
>
>? ? auto lmchain = poolInstance.map!"a * a"(
>? ? ? ? poolInstance.map!sqrt(
>? ? ? ? ? ? poolInstance.asyncBuf(
>? ? ? ? ? ? ? ? iota(3_000_000)
>? ? ? ? ? ? )
>? ? ? ? )
>? ? );
>? ? foreach(i, elem; parallel(lmchain)) {
>? ? ? ? assert(approxEqual(elem, i));
>? ? }
>
>In other words, it's the test that uses everything together (including Task and amap() under the hood), the hardest one to debug.
>
>IIUC, the instruction stream can't be overwritten by a buggy program because the code pages are marked read-only.? The only other explanation I can think of for how the program counter could be corrupted is if some race condition corrupts either a function pointer or a return address on the stack.? However, in this case the address that the program counter ends up at when the segfault happens should be less deterministic.
>
>_______________________________________________
>phobos mailing list
>phobos at puremagic.com
>http://lists.puremagic.com/mailman/listinfo/phobos
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110429/5923fd75/attachment-0001.html>
April 29, 2011
On Fri, Apr 29, 2011 at 9:41 AM, Steve Schveighoffer <schveiguy at yahoo.com>wrote:

> I'll mention that I just was debugging some stuff in Linux for dcollections, and the line numbers for the unit tests given by gdb were off by a couple of lines (I had the same feelings you are having, how is this possible).  That was just inside the unit tests.  In the actual code, the line numbers were correct.  What's more, when I would step over lines in gdb, the program would jump *back* some lines when there was no loop in the code.
>
> I wouldn't trust that GDB is giving you accurate info.
>
> My recommendation -- try using writeln debugging if possible.
>

Thanks for the info.  The problem with writeln is that, when dealing with nondeterministic bugs-from-Hell, you don't want to change timings or have screens and screens of stuff printed.  Now that I think of it, though, scope(failure) might be useful here.  For example:

void fun() {
    scope(failure) stderr.writeln("Failed in fun().");
    // Do stuff.
}

 Also, I'll try running the code in a main block instead of a unittest
block.  If it works then it's a compiler bug w.r.t. unit tests.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110429/00c5a3bd/attachment.html>
April 29, 2011
David Simcha wrote:
> I've spent some serious time looking into the FreeBSD std.parallelism segfaults.  I'm at a complete loss as to what could be causing them or how to fix them.  Here are some observations.  Someone please offer any suggestions you have.
[...]
> Is there a good debugger for Windows that will give me stack traces and stuff like GDB?
> 

For the best debugging experience I'd suggest to convert the debug info with cv2pdb and use Visual Studios' debugger. You can also use Visual C++ Express/Visual Studio Shell.

mago is a debugger dedicated explicitely to D and does not need cv2pdb, but also plugs into Visual Studio.

Both cv2pdb/mago are installed with Visual D (mago and Visual D do not work with VS Express versions).
April 29, 2011
On 29 April 2011 16:08, David Simcha <dsimcha at gmail.com> wrote:
>
>
> On Fri, Apr 29, 2011 at 9:41 AM, Steve Schveighoffer <schveiguy at yahoo.com> wrote:
>>
>> I'll mention that I just was debugging some stuff in Linux for
>> dcollections, and the line numbers for the unit tests given by gdb were off
>> by a couple of lines (I had the same feelings you are having, how is this
>> possible).? That was just inside the unit tests.? In the actual code, the
>> line numbers were correct.? What's more, when I would step over lines in
>> gdb, the program would jump *back* some lines when there was no loop in the
>> code.
>> I wouldn't trust that GDB is giving you accurate info.
>>
>> My recommendation -- try using writeln debugging if possible.
>
> Thanks for the info.? The problem with writeln is that, when dealing with nondeterministic bugs-from-Hell, you don't want to change timings or have screens and screens of stuff printed.

On Windows, use OutputDebugString("str").

(signature extern(Windows) void OutputDebugStringA(char *s); )

Download dbgview from Sysinternals (now owned by microsoft) to see the results. This is the only way I've ever been able to debug threading problems.
May 01, 2011
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110501/6301a445/attachment.html>