Thread overview
[dmd-internals] Calling Convention Bugs?
Apr 30, 2011
David Simcha
May 01, 2011
Sean Kelly
May 01, 2011
David Simcha
April 30, 2011
I've spent a whole bunch of time trying to debug my std.parallelism Heisenbug from Hell.  My latest theory is that the root cause is actually a calling convention bug (either in DMD or in some inline ASM; the results below were produced using core.atomic, not any of my own inline ASM) but only manifests in std.parallelism under very specific thread interleavings that lead to infrequently used code paths.

On Windows, I get failed unit tests instead of segfaults, which makes life slightly easier.  (Hopefully these have the same root cause, I'm guessing they do.)  I finally managed to instrument my Map pipelining class to print a whole bunch of data whenever the bug occurs, without changing the timings to prevent the bug from occurring.  The issue appears to be that, about one out of every 10,000 times, the whole source buffer gets corrupted.  However, the funny thing is that there's a very specific pattern to the corruption.  For example, the following buffer is supposed to contain the square roots of all numbers between 1380299 and 1380399 but instead has the following:

[-8.29044e+16, -8.45933e+16, -8.68451e+16, -8.85339e+16, -9.07857e+16, -9.24746e+16, -9.47264e+16, -9.64152e+16, -9.8667e+16, -1.00356e+17, -1.02608e+17, -1.04297e+17, -1.06548e+17, -1.08237e+17, -1.10489e+17, -1.12178e+17, -1.1443e+17, -1.16118e+17, -1.1837e+17, -1.20059e+17, -1.22311e+17, -1.24e+17, -1.26252e+17, -1.2794e+17, -1.30192e+17, -1.31881e+17, -1.34133e+17, -1.35822e+17, -1.38074e+17, -1.39762e+17, -1.42014e+17, -1.43703e+17, -1.47794e+17, -1.51172e+17, -1.55676e+17, -1.59053e+17, -1.63557e+17, -1.66935e+17, -1.70312e+17, -1.74816e+17, -1.78194e+17, -1.82697e+17, -1.86075e+17, -1.90579e+17, -1.93956e+17, -1.9846e+17, -2.01838e+17, -2.06341e+17, -2.09719e+17, -2.14223e+17, -2.176e+17, -2.22104e+17, -2.25482e+17, -2.29985e+17, -2.33363e+17, -2.37866e+17, -2.41244e+17, -2.45748e+17, -2.49125e+17, -2.53629e+17, -2.57007e+17, -2.6151e+17, -2.64888e+17, -2.69392e+17, -2.72769e+17, -2.77273e+17, -2.80651e+17, -2.85154e+17, -2.88834e+17, -2.97841e+17, -3.04596e+17, -3.13603e+17, -3.20359e+17, -3.27114e+17, -3.36121e+17, -3.42877e+17, -3.51884e+17, -3.58639e+17, -3.67647e+17, -3.74402e+17, -3.83409e+17, -3.90165e+17, -3.99172e+17, -4.05927e+17, -4.14934e+17, -4.2169e+17, -4.30697e+17, -4.37452e+17, -4.4646e+17, -4.53215e+17, -4.62222e+17, -4.68978e+17, -4.77985e+17, -4.8474e+17, -4.93747e+17, -5.00503e+17, -5.0951e+17, -5.16265e+17, -5.25273e+17, 2.45984e-41]

The interesting thing is that there's a very regular pattern to these numbers.  They're always sorted like this, except that the last number is  always tiny.  Don't ask me how I thought of this, but I figured the low-order bits of the pointer to this buffer must somehow be getting corrupted, so that the numbers are being viewed and/or written out-of-frame.  Sure enough, when I shift the reading frame of this array by two bytes, I get:

[1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9]

To within rounding error (since the numbers are only printed to two decimal places), 1174.86 ^ 2 == 1380299 and 1149.4 ^ 2 == 1380399.  The other failures display similar patterns, where the source buffer is being read or written to two bytes out of frame.  There's even a significance to the number 2.  This is the enum value that's used to signal TaskStatus.done.  It gets passed to the atomic functions all the time, probably in an 8-bit register (AL, BL, CL, DL, etc.).  Sure enough, if I change TaskStatus.done to be 3 instead of 2, my reading frame ends up 3 bytes out of phase instead of 2.  Since the source buffer is 100 elements long, and each element is 4 bytes, it is allocated in the GC's 512-byte pool and is aligned on 512-byte boundaries.  Therefore, allowing TaskStatus.done to overwrite the low-order bits of a 32- or 64-bit register holding the pointer to the source buffer would shift the reading frame by TaskStatus.done bytes.

Further proving that this is the issue, when I print out the address of the corrupt buffer, it's actually 2 or 3 (depending on the value of TaskStatus.done) bytes offset from a proper 4-byte float alignment.

Does anyone know if there's already a bug report about this?  I barely know where to start trying to create a sane test case that's not dependent on some ridiculously complicated code path and thread interleavings, but I do know that this qualifies as war story debugging and will make for a great answer if I'm ever on a job interview and the interviewer wants to know what the toughest thing I ever debugged is.
May 01, 2011
I don't suppose you're using core atomic with floating-point values?  That won't work, currently.

Sent from my iPhone

On Apr 30, 2011, at 1:21 PM, David Simcha <dsimcha at gmail.com> wrote:

> I've spent a whole bunch of time trying to debug my std.parallelism Heisenbug from Hell.  My latest theory is that the root cause is actually a calling convention bug (either in DMD or in some inline ASM; the results below were produced using core.atomic, not any of my own inline ASM) but only manifests in std.parallelism under very specific thread interleavings that lead to infrequently used code paths.
> 
> On Windows, I get failed unit tests instead of segfaults, which makes life slightly easier.  (Hopefully these have the same root cause, I'm guessing they do.)  I finally managed to instrument my Map pipelining class to print a whole bunch of data whenever the bug occurs, without changing the timings to prevent the bug from occurring.  The issue appears to be that, about one out of every 10,000 times, the whole source buffer gets corrupted.  However, the funny thing is that there's a very specific pattern to the corruption.  For example, the following buffer is supposed to contain the square roots of all numbers between 1380299 and 1380399 but instead has the following:
> 
> [-8.29044e+16, -8.45933e+16, -8.68451e+16, -8.85339e+16, -9.07857e+16, -9.24746e+16, -9.47264e+16, -9.64152e+16, -9.8667e+16, -1.00356e+17, -1.02608e+17, -1.04297e+17, -1.06548e+17, -1.08237e+17, -1.10489e+17, -1.12178e+17, -1.1443e+17, -1.16118e+17, -1.1837e+17, -1.20059e+17, -1.22311e+17, -1.24e+17, -1.26252e+17, -1.2794e+17, -1.30192e+17, -1.31881e+17, -1.34133e+17, -1.35822e+17, -1.38074e+17, -1.39762e+17, -1.42014e+17, -1.43703e+17, -1.47794e+17, -1.51172e+17, -1.55676e+17, -1.59053e+17, -1.63557e+17, -1.66935e+17, -1.70312e+17, -1.74816e+17, -1.78194e+17, -1.82697e+17, -1.86075e+17, -1.90579e+17, -1.93956e+17, -1.9846e+17, -2.01838e+17, -2.06341e+17, -2.09719e+17, -2.14223e+17, -2.176e+17, -2.22104e+17, -2.25482e+17, -2.29985e+17, -2.33363e+17, -2.37866e+17, -2.41244e+17, -2.45748e+17, -2.49125e+17, -2.53629e+17, -2.57007e+17, -2.6151e+17, -2.64888e+17, -2.69392e+17, -2.72769e+17, -2.77273e+17, -2.80651e+17, -2.85154e+17, -2.88834e+17, -2.97841e+17, -3.04596e+17, -3.13603e+17, -3.20359e+17, -3.27114e+17, -3.36121e+17, -3.42877e+17, -3.51884e+17, -3.58639e+17, -3.67647e+17, -3.74402e+17, -3.83409e+17, -3.90165e+17, -3.99172e+17, -4.05927e+17, -4.14934e+17, -4.2169e+17, -4.30697e+17, -4.37452e+17, -4.4646e+17, -4.53215e+17, -4.62222e+17, -4.68978e+17, -4.77985e+17, -4.8474e+17, -4.93747e+17, -5.00503e+17, -5.0951e+17, -5.16265e+17, -5.25273e+17, 2.45984e-41]
> 
> The interesting thing is that there's a very regular pattern to these numbers.  They're always sorted like this, except that the last number is  always tiny.  Don't ask me how I thought of this, but I figured the low-order bits of the pointer to this buffer must somehow be getting corrupted, so that the numbers are being viewed and/or written out-of-frame.  Sure enough, when I shift the reading frame of this array by two bytes, I get:
> 
> [1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9]
> 
> To within rounding error (since the numbers are only printed to two decimal places), 1174.86 ^ 2 == 1380299 and 1149.4 ^ 2 == 1380399.  The other failures display similar patterns, where the source buffer is being read or written to two bytes out of frame.  There's even a significance to the number 2.  This is the enum value that's used to signal TaskStatus.done.  It gets passed to the atomic functions all the time, probably in an 8-bit register (AL, BL, CL, DL, etc.).  Sure enough, if I change TaskStatus.done to be 3 instead of 2, my reading frame ends up 3 bytes out of phase instead of 2.  Since the source buffer is 100 elements long, and each element is 4 bytes, it is allocated in the GC's 512-byte pool and is aligned on 512-byte boundaries.  Therefore, allowing TaskStatus.done to overwrite the low-order bits of a 32- or 64-bit register holding the pointer to the source buffer would shift the reading frame by TaskStatus.done bytes.
> 
> Further proving that this is the issue, when I print out the address of the corrupt buffer, it's actually 2 or 3 (depending on the value of TaskStatus.done) bytes offset from a proper 4-byte float alignment.
> 
> Does anyone know if there's already a bug report about this?  I barely know where to start trying to create a sane test case that's not dependent on some ridiculously complicated code path and thread interleavings, but I do know that this qualifies as war story debugging and will make for a great answer if I'm ever on a job interview and the interviewer wants to know what the toughest thing I ever debugged is.
> _______________________________________________
> dmd-internals mailing list
> dmd-internals at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-internals
May 01, 2011
Nope.  I'm using it with ubytes.  However, pointers to floats are getting corrupted somehow, and since the details of how they get corrupted are related to the value of TaskStatus.done, I'm assuming it's probably related to atomics.

On 5/1/2011 11:08 AM, Sean Kelly wrote:
> I don't suppose you're using core atomic with floating-point values?  That won't work, currently.
>
> Sent from my iPhone
>
> On Apr 30, 2011, at 1:21 PM, David Simcha<dsimcha at gmail.com>  wrote:
>
>> I've spent a whole bunch of time trying to debug my std.parallelism Heisenbug from Hell.  My latest theory is that the root cause is actually a calling convention bug (either in DMD or in some inline ASM; the results below were produced using core.atomic, not any of my own inline ASM) but only manifests in std.parallelism under very specific thread interleavings that lead to infrequently used code paths.
>>
>> On Windows, I get failed unit tests instead of segfaults, which makes life slightly easier.  (Hopefully these have the same root cause, I'm guessing they do.)  I finally managed to instrument my Map pipelining class to print a whole bunch of data whenever the bug occurs, without changing the timings to prevent the bug from occurring.  The issue appears to be that, about one out of every 10,000 times, the whole source buffer gets corrupted.  However, the funny thing is that there's a very specific pattern to the corruption.  For example, the following buffer is supposed to contain the square roots of all numbers between 1380299 and 1380399 but instead has the following:
>>
>> [-8.29044e+16, -8.45933e+16, -8.68451e+16, -8.85339e+16, -9.07857e+16, -9.24746e+16, -9.47264e+16, -9.64152e+16, -9.8667e+16, -1.00356e+17, -1.02608e+17, -1.04297e+17, -1.06548e+17, -1.08237e+17, -1.10489e+17, -1.12178e+17, -1.1443e+17, -1.16118e+17, -1.1837e+17, -1.20059e+17, -1.22311e+17, -1.24e+17, -1.26252e+17, -1.2794e+17, -1.30192e+17, -1.31881e+17, -1.34133e+17, -1.35822e+17, -1.38074e+17, -1.39762e+17, -1.42014e+17, -1.43703e+17, -1.47794e+17, -1.51172e+17, -1.55676e+17, -1.59053e+17, -1.63557e+17, -1.66935e+17, -1.70312e+17, -1.74816e+17, -1.78194e+17, -1.82697e+17, -1.86075e+17, -1.90579e+17, -1.93956e+17, -1.9846e+17, -2.01838e+17, -2.06341e+17, -2.09719e+17, -2.14223e+17, -2.176e+17, -2.22104e+17, -2.25482e+17, -2.29985e+17, -2.33363e+17, -2.37866e+17, -2.41244e+17, -2.45748e+17, -2.49125e+17, -2.53629e+17, -2.57007e+17, -2.6151e+17, -2.64888e+17, -2.69392e+17, -2.72769e+17, -2.77273e+17, -2.80651e+17, -2.85154e+17, -2.88834e+17, -2.97841e+17, -3.04596e+17, -3.!
>   13603e+17, -3.20359e+17, -3.27114e+17, -3.36121e+17, -3.42877e+17, -3.51884e+17, -3.58639e+17, -3.67647e+17, -3.74402e+17, -3.83409e+17, -3.90165e+17, -3.99172e+17, -4.05927e+17, -4.14934e+17, -4.2169e+17, -4.30697e+17, -4.37452e+17, -4.4646e+17, -4.53215e+17, -4.62222e+17, -4.68978e+17, -4.77985e+17, -4.8474e+17, -4.93747e+17, -5.00503e+17, -5.0951e+17, -5.16265e+17, -5.25273e+17, 2.45984e-41]
>> The interesting thing is that there's a very regular pattern to these numbers.  They're always sorted like this, except that the last number is  always tiny.  Don't ask me how I thought of this, but I figured the low-order bits of the pointer to this buffer must somehow be getting corrupted, so that the numbers are being viewed and/or written out-of-frame.  Sure enough, when I shift the reading frame of this array by two bytes, I get:
>>
>> [1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9]
>>
>> To within rounding error (since the numbers are only printed to two decimal places), 1174.86 ^ 2 == 1380299 and 1149.4 ^ 2 == 1380399.  The other failures display similar patterns, where the source buffer is being read or written to two bytes out of frame.  There's even a significance to the number 2.  This is the enum value that's used to signal TaskStatus.done.  It gets passed to the atomic functions all the time, probably in an 8-bit register (AL, BL, CL, DL, etc.).  Sure enough, if I change TaskStatus.done to be 3 instead of 2, my reading frame ends up 3 bytes out of phase instead of 2.  Since the source buffer is 100 elements long, and each element is 4 bytes, it is allocated in the GC's 512-byte pool and is aligned on 512-byte boundaries.  Therefore, allowing TaskStatus.done to overwrite the low-order bits of a 32- or 64-bit register holding the pointer to the source buffer would shift the reading frame by TaskStatus.done bytes.
>>
>> Further proving that this is the issue, when I print out the address of the corrupt buffer, it's actually 2 or 3 (depending on the value of TaskStatus.done) bytes offset from a proper 4-byte float alignment.
>>
>> Does anyone know if there's already a bug report about this?  I barely know where to start trying to create a sane test case that's not dependent on some ridiculously complicated code path and thread interleavings, but I do know that this qualifies as war story debugging and will make for a great answer if I'm ever on a job interview and the interviewer wants to know what the toughest thing I ever debugged is.
>> _______________________________________________
>> dmd-internals mailing list
>> dmd-internals at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/dmd-internals
> _______________________________________________
> dmd-internals mailing list
> dmd-internals at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-internals
>