std.parallelism's unit tests randomly hang on win32

I have a dual core system, if that helps.

An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110501/c39370ef/attachment-0001.html>

Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .)  This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.

Therefore, somehow the low order bits of pointers are getting corrupted with the value of TaskStatus.done in several places.  This is strong evidence that the underlying issue is a codegen bug or a bug in the ASM for the atomic ops, not a concurrency bug.

Is core.atomic not saving and restoring a register it should?

Sent from my iPhone

On May 1, 2011, at 12:44 PM, David Simcha <dsimcha at gmail.com> wrote:

> Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .)  This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.
> 
> Therefore, somehow the low order bits of pointers are getting corrupted with the value of TaskStatus.done in several places.  This is strong evidence that the underlying issue is a codegen bug or a bug in the ASM for the atomic ops, not a concurrency bug.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos

On 5/1/2011 8:30 PM, Sean Kelly wrote:
> Is core.atomic not saving and restoring a register it should?
>

Doesn't look like.  I've reviewed this code to check for that.  I think it could be one of two things:

1.  A DMD codegen bug that's not saving the registers properly.

2.  An extremely subtle concurrency bug that's causing codepaths that "can't happen" to be executed, meaning DMD doesn't save the registers properly because the relevant codepaths "can't happen".

On 05/01/2011 06:08 PM, David Simcha wrote:
> On 5/1/2011 8:30 PM, Sean Kelly wrote:
>> Is core.atomic not saving and restoring a register it should?
>>
>
> Doesn't look like.  I've reviewed this code to check for that.  I think it could be one of two things:
>
> 1.  A DMD codegen bug that's not saving the registers properly.
>
> 2.  An extremely subtle concurrency bug that's causing codepaths that "can't happen" to be executed, meaning DMD doesn't save the registers properly because the relevant codepaths "can't happen".
>

If it is the second case, then it might also exist as a problem in DMC which suggests that it might be open to attack via the methods this guy is developing/using:

http://blog.regehr.org/archives/503

(The first half is background, the second half is the relevant part)

> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos

On 4/30/2011 3:26 AM, Walter Bright wrote:
> I have a dual core system, if that helps.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>

The issues with std.parallelism happen on both 32 and 64 bit, on both DMD and GDC.  Does this point to the root cause being a concurrency bug, since a bug in register management would likely be in the backend, not the frontend?

On 5/1/2011 12:44 PM, David Simcha wrote:
> Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .)  This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.
>

Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.

On 5/3/2011 1:32 AM, Walter Bright wrote:
>
>
> On 5/1/2011 12:44 PM, David Simcha wrote:
>> Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .) This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.
>>
>
> Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.

Been trying to do that, but I think there are multiple places where this is happening and the asserts are affecting codegen or timings just enough to prevent some.  Similarly, putting a try/catch block in some seemingly unrelated place prevents certain manifestations on Windows.

On 5/3/2011 5:43 AM, David Simcha wrote:
>
>> Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.
>
>
> Been trying to do that, but I think there are multiple places where this is happening and the asserts are affecting codegen or timings just enough to prevent some.

You can also do the simple:

     if (ptr == bad value) *((char*)0)=0;

which doesn't perturb timings or code gen much. I use these often. The debugger will tell you which one tripped.

Forums