Jump to page: 1 2
Thread overview
[phobos] std.parallelism's unit tests randomly hang on win32
Apr 30, 2011
Walter Bright
May 01, 2011
David Simcha
May 01, 2011
David Simcha
May 02, 2011
Sean Kelly
May 02, 2011
David Simcha
May 03, 2011
Walter Bright
May 03, 2011
David Simcha
May 03, 2011
Walter Bright
May 04, 2011
David Simcha
May 04, 2011
Walter Bright
May 04, 2011
David Simcha
May 04, 2011
Walter Bright
May 04, 2011
David Simcha
May 04, 2011
David Simcha
May 04, 2011
Sean Kelly
May 04, 2011
David Simcha
May 03, 2011
David Simcha
April 30, 2011
I have a dual core system, if that helps.
May 01, 2011
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110501/c39370ef/attachment-0001.html>
May 01, 2011
Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .)  This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.

Therefore, somehow the low order bits of pointers are getting corrupted with the value of TaskStatus.done in several places.  This is strong evidence that the underlying issue is a codegen bug or a bug in the ASM for the atomic ops, not a concurrency bug.
May 01, 2011
Is core.atomic not saving and restoring a register it should?

Sent from my iPhone

On May 1, 2011, at 12:44 PM, David Simcha <dsimcha at gmail.com> wrote:

> Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .)  This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.
> 
> Therefore, somehow the low order bits of pointers are getting corrupted with the value of TaskStatus.done in several places.  This is strong evidence that the underlying issue is a codegen bug or a bug in the ASM for the atomic ops, not a concurrency bug.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
May 01, 2011
On 5/1/2011 8:30 PM, Sean Kelly wrote:
> Is core.atomic not saving and restoring a register it should?
>

Doesn't look like.  I've reviewed this code to check for that.  I think it could be one of two things:

1.  A DMD codegen bug that's not saving the registers properly.

2.  An extremely subtle concurrency bug that's causing codepaths that "can't happen" to be executed, meaning DMD doesn't save the registers properly because the relevant codepaths "can't happen".

May 01, 2011
On 05/01/2011 06:08 PM, David Simcha wrote:
> On 5/1/2011 8:30 PM, Sean Kelly wrote:
>> Is core.atomic not saving and restoring a register it should?
>>
>
> Doesn't look like.  I've reviewed this code to check for that.  I think it could be one of two things:
>
> 1.  A DMD codegen bug that's not saving the registers properly.
>
> 2.  An extremely subtle concurrency bug that's causing codepaths that "can't happen" to be executed, meaning DMD doesn't save the registers properly because the relevant codepaths "can't happen".
>

If it is the second case, then it might also exist as a problem in DMC which suggests that it might be open to attack via the methods this guy is developing/using:

http://blog.regehr.org/archives/503

(The first half is background, the second half is the relevant part)

> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos

May 02, 2011
On 4/30/2011 3:26 AM, Walter Bright wrote:
> I have a dual core system, if that helps.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>

The issues with std.parallelism happen on both 32 and 64 bit, on both DMD and GDC.  Does this point to the root cause being a concurrency bug, since a bug in register management would likely be in the backend, not the frontend?
May 02, 2011

On 5/1/2011 12:44 PM, David Simcha wrote:
> Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .)  This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.
>

Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.
May 03, 2011
On 5/3/2011 1:32 AM, Walter Bright wrote:
>
>
> On 5/1/2011 12:44 PM, David Simcha wrote:
>> Update:  The segfaults on Linux64 are also being caused by the low order bit corruption bug.  Whenever I look at the chain of pointer dereferences in GDB by viewing the registers and disassembly on crash, the segfault is always caused by dereferencing a pointer to some memory address that's clearly illegal.  (On x64 user mode addresses can't have their high order bits set, and the addresses being dereferenced often do.  See http://en.wikipedia.org/wiki/X64#Virtual_address_space_details .) This wild pointer is obtained by dereferencing another pointer whose low order bits are always equal to TaskStatus.done.  For example, if TaskStatus.done == 2, the pointer might be something like 0x0000ABCD EF123402.  If TaskStatus.done == 1, it will be something like 0x0000ABCD EF123401.
>>
>
> Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.

Been trying to do that, but I think there are multiple places where this is happening and the asserts are affecting codegen or timings just enough to prevent some.  Similarly, putting a try/catch block in some seemingly unrelated place prevents certain manifestations on Windows.
May 03, 2011

On 5/3/2011 5:43 AM, David Simcha wrote:
>
>> Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.
>
>
> Been trying to do that, but I think there are multiple places where this is happening and the asserts are affecting codegen or timings just enough to prevent some.

You can also do the simple:

     if (ptr == bad value) *((char*)0)=0;

which doesn't perturb timings or code gen much. I use these often. The debugger will tell you which one tripped.
« First   ‹ Prev
1 2