May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | I went a slightly different route and tried to reduce the problem to as small a test case as possible, like I would normally do for a compiler bug. So far I've managed to reduce it to ~560 lines. I've discovered this one's more unstable (i.e. the results change a lot more in response to slight perturbations) than I thought. Just changing the layout of the Task struct (deleting member variables that are no longer used anywhere) makes it go from unit test failures to access violations. Adding or removing try/catch blocks or empty destructors in some places can completely prevent the bug from manifesting. On Linux, if I perturb things slightly by changing the layout of Task, I get exceptions thrown from core.sync. This looks like some kind of memory/stack corruption bug but due to its nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't. On Tue, May 3, 2011 at 1:00 PM, Walter Bright <walter at digitalmars.com>wrote: > > > On 5/3/2011 5:43 AM, David Simcha wrote: > >> >> Add asserts on that pointer value going out of range, and keep working >>> backwards until the point where the value goes wrong is discovered. >>> >> >> >> Been trying to do that, but I think there are multiple places where this >> is happening and the asserts are affecting codegen or timings just enough to >> prevent some. >> > > You can also do the simple: > > if (ptr == bad value) *((char*)0)=0; > > which doesn't perturb timings or code gen much. I use these often. The debugger will tell you which one tripped. > > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110504/44480ad8/attachment.html> |
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | Does it work as a single threaded program?
On 5/4/2011 6:51 AM, David Simcha wrote:
> I went a slightly different route and tried to reduce the problem to as small a test case as possible, like I would normally do for a compiler bug. So far I've managed to reduce it to ~560 lines. I've discovered this one's more unstable (i.e. the results change a lot more in response to slight perturbations) than I thought. Just changing the layout of the Task struct (deleting member variables that are no longer used anywhere) makes it go from unit test failures to access violations. Adding or removing try/catch blocks or empty destructors in some places can completely prevent the bug from manifesting. On Linux, if I perturb things slightly by changing the layout of Task, I get exceptions thrown from core.sync.
>
> This looks like some kind of memory/stack corruption bug but due to its nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't.
>
|
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Yes, it works as a single threaded program, but there are a lot of code paths that are never taken unless a worker thread finishes a job before the submitter thread needs the result (which obviously can't happen in single-threaded mode). Therefore, this does not prove that the issue is a concurrency bug. On Wed, May 4, 2011 at 1:37 PM, Walter Bright <walter at digitalmars.com>wrote: > Does it work as a single threaded program? > > > On 5/4/2011 6:51 AM, David Simcha wrote: > >> I went a slightly different route and tried to reduce the problem to as >> small a test case as possible, like I would normally do for a compiler bug. >> So far I've managed to reduce it to ~560 lines. I've discovered this one's >> more unstable (i.e. the results change a lot more in response to slight >> perturbations) than I thought. Just changing the layout of the Task struct >> (deleting member variables that are no longer used anywhere) makes it go >> from unit test failures to access violations. Adding or removing try/catch >> blocks or empty destructors in some places can completely prevent the bug >> from manifesting. On Linux, if I perturb things slightly by changing the >> layout of Task, I get exceptions thrown from core.sync. >> >> This looks like some kind of memory/stack corruption bug but due to its nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't. >> >> _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110504/4db5c2e2/attachment.html> |
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | I guess I'm asking if there is a way to execute all those paths in a single threaded manner, in order to flush out any suspected code gen bugs. On 5/4/2011 10:39 AM, David Simcha wrote: > Yes, it works as a single threaded program, but there are a lot of code paths that are never taken unless a worker thread finishes a job before the submitter thread needs the result (which obviously can't happen in single-threaded mode). Therefore, this does not prove that the issue is a concurrency bug. > > On Wed, May 4, 2011 at 1:37 PM, Walter Bright <walter at digitalmars.com <mailto:walter at digitalmars.com>> wrote: > > Does it work as a single threaded program? > > > On 5/4/2011 6:51 AM, David Simcha wrote: > > I went a slightly different route and tried to reduce the problem to > as small a test case as possible, like I would normally do for a > compiler bug. So far I've managed to reduce it to ~560 lines. I've > discovered this one's more unstable (i.e. the results change a lot > more in response to slight perturbations) than I thought. Just > changing the layout of the Task struct (deleting member variables that > are no longer used anywhere) makes it go from unit test failures to > access violations. Adding or removing try/catch blocks or empty > destructors in some places can completely prevent the bug from > manifesting. On Linux, if I perturb things slightly by changing the > layout of Task, I get exceptions thrown from core.sync. > > This looks like some kind of memory/stack corruption bug but due to > its nondeterminism (only a few thread interleavings seem to take the > proper codepath and I'm not sure which ones) and its very indirect > manifestation (memory corruption; the low order bit overwriting thing > was, I think, just a manifestation of a deeper problem), I am somewhat > at a loss for how to debug it. I've scrutinized the concurrency > related aspects and still can't find any bugs there. However, I can't > prove it's not a concurrency bug since running in single threaded mode > prevents certain code paths from being taken. Unless I get some > advice that changes things, I think my next move is to compare the > disassemblies for cases that work to those for cases that don't. > > _______________________________________________ > phobos mailing list > phobos at puremagic.com <mailto:phobos at puremagic.com> > http://lists.puremagic.com/mailman/listinfo/phobos > > > > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110504/ac6e1ca7/attachment-0001.html> |
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Probably not. The code includes things like waiting on condition variables and expecting to be resumed by other threads. On Wed, May 4, 2011 at 2:13 PM, Walter Bright <walter at digitalmars.com>wrote: > I guess I'm asking if there is a way to execute all those paths in a > single threaded manner, in order to flush out any suspected code gen bugs. > > > On 5/4/2011 10:39 AM, David Simcha wrote: > > Yes, it works as a single threaded program, but there are a lot of code paths that are never taken unless a worker thread finishes a job before the submitter thread needs the result (which obviously can't happen in single-threaded mode). Therefore, this does not prove that the issue is a concurrency bug. > > On Wed, May 4, 2011 at 1:37 PM, Walter Bright <walter at digitalmars.com>wrote: > >> Does it work as a single threaded program? >> >> >> On 5/4/2011 6:51 AM, David Simcha wrote: >> >>> I went a slightly different route and tried to reduce the problem to as >>> small a test case as possible, like I would normally do for a compiler bug. >>> So far I've managed to reduce it to ~560 lines. I've discovered this one's >>> more unstable (i.e. the results change a lot more in response to slight >>> perturbations) than I thought. Just changing the layout of the Task struct >>> (deleting member variables that are no longer used anywhere) makes it go >>> from unit test failures to access violations. Adding or removing try/catch >>> blocks or empty destructors in some places can completely prevent the bug >>> from manifesting. On Linux, if I perturb things slightly by changing the >>> layout of Task, I get exceptions thrown from core.sync. >>> >>> This looks like some kind of memory/stack corruption bug but due to its nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't. >>> >>> _______________________________________________ >> phobos mailing list >> phobos at puremagic.com >> http://lists.puremagic.com/mailman/listinfo/phobos >> > > > _______________________________________________ > phobos mailing listphobos at puremagic.comhttp://lists.puremagic.com/mailman/listinfo/phobos > > > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110504/b85e7bcd/attachment.html> |
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | BTW, just to clarify, I am going to keep working on it, it's just that progress is slow because this is such a nightmarish bug and I'd like to not hold up the release. Therefore, I'm asking for help often to get this thing fixed sooner rather than later. On Wed, May 4, 2011 at 2:44 PM, David Simcha <dsimcha at gmail.com> wrote: > Probably not. The code includes things like waiting on condition variables and expecting to be resumed by other threads. > > > On Wed, May 4, 2011 at 2:13 PM, Walter Bright <walter at digitalmars.com>wrote: > >> I guess I'm asking if there is a way to execute all those paths in a >> single threaded manner, in order to flush out any suspected code gen bugs. >> >> >> On 5/4/2011 10:39 AM, David Simcha wrote: >> >> Yes, it works as a single threaded program, but there are a lot of code paths that are never taken unless a worker thread finishes a job before the submitter thread needs the result (which obviously can't happen in single-threaded mode). Therefore, this does not prove that the issue is a concurrency bug. >> >> On Wed, May 4, 2011 at 1:37 PM, Walter Bright <walter at digitalmars.com>wrote: >> >>> Does it work as a single threaded program? >>> >>> >>> On 5/4/2011 6:51 AM, David Simcha wrote: >>> >>>> I went a slightly different route and tried to reduce the problem to as >>>> small a test case as possible, like I would normally do for a compiler bug. >>>> So far I've managed to reduce it to ~560 lines. I've discovered this one's >>>> more unstable (i.e. the results change a lot more in response to slight >>>> perturbations) than I thought. Just changing the layout of the Task struct >>>> (deleting member variables that are no longer used anywhere) makes it go >>>> from unit test failures to access violations. Adding or removing try/catch >>>> blocks or empty destructors in some places can completely prevent the bug >>>> from manifesting. On Linux, if I perturb things slightly by changing the >>>> layout of Task, I get exceptions thrown from core.sync. >>>> >>>> This looks like some kind of memory/stack corruption bug but due to its nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't. >>>> >>>> _______________________________________________ >>> phobos mailing list >>> phobos at puremagic.com >>> http://lists.puremagic.com/mailman/listinfo/phobos >>> >> >> >> _______________________________________________ >> phobos mailing listphobos at puremagic.comhttp://lists.puremagic.com/mailman/listinfo/phobos >> >> >> _______________________________________________ >> phobos mailing list >> phobos at puremagic.com >> http://lists.puremagic.com/mailman/listinfo/phobos >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110504/78fba87f/attachment.html> |
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Simcha | Is this with or without -release set? It shouldn't matter, but I'm curious if you're performing atomic ops on unaligned data. The contracts in core.atomic should check for this.
On May 4, 2011, at 6:51 AM, David Simcha wrote:
> I went a slightly different route and tried to reduce the problem to as small a test case as possible, like I would normally do for a compiler bug. So far I've managed to reduce it to ~560 lines. I've discovered this one's more unstable (i.e. the results change a lot more in response to slight perturbations) than I thought. Just changing the layout of the Task struct (deleting member variables that are no longer used anywhere) makes it go from unit test failures to access violations. Adding or removing try/catch blocks or empty destructors in some places can completely prevent the bug from manifesting. On Linux, if I perturb things slightly by changing the layout of Task, I get exceptions thrown from core.sync.
>
> This looks like some kind of memory/stack corruption bug but due to its nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't.
>
> On Tue, May 3, 2011 at 1:00 PM, Walter Bright <walter at digitalmars.com> wrote:
>
>
> On 5/3/2011 5:43 AM, David Simcha wrote:
>
> Add asserts on that pointer value going out of range, and keep working backwards until the point where the value goes wrong is discovered.
>
>
> Been trying to do that, but I think there are multiple places where this is happening and the asserts are affecting codegen or timings just enough to prevent some.
>
> You can also do the simple:
>
> if (ptr == bad value) *((char*)0)=0;
>
> which doesn't perturb timings or code gen much. I use these often. The debugger will tell you which one tripped.
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
|
May 04, 2011 [phobos] std.parallelism's unit tests randomly hang on win32 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | Both. It shouldn't matter anyhow b/c I'm only performing atomic ops on bytes. On Wed, May 4, 2011 at 4:28 PM, Sean Kelly <sean at invisibleduck.org> wrote: > Is this with or without -release set? It shouldn't matter, but I'm curious if you're performing atomic ops on unaligned data. The contracts in core.atomic should check for this. > > On May 4, 2011, at 6:51 AM, David Simcha wrote: > > > I went a slightly different route and tried to reduce the problem to as > small a test case as possible, like I would normally do for a compiler bug. > So far I've managed to reduce it to ~560 lines. I've discovered this one's > more unstable (i.e. the results change a lot more in response to slight > perturbations) than I thought. Just changing the layout of the Task struct > (deleting member variables that are no longer used anywhere) makes it go > from unit test failures to access violations. Adding or removing try/catch > blocks or empty destructors in some places can completely prevent the bug > from manifesting. On Linux, if I perturb things slightly by changing the > layout of Task, I get exceptions thrown from core.sync. > > > > This looks like some kind of memory/stack corruption bug but due to its > nondeterminism (only a few thread interleavings seem to take the proper codepath and I'm not sure which ones) and its very indirect manifestation (memory corruption; the low order bit overwriting thing was, I think, just a manifestation of a deeper problem), I am somewhat at a loss for how to debug it. I've scrutinized the concurrency related aspects and still can't find any bugs there. However, I can't prove it's not a concurrency bug since running in single threaded mode prevents certain code paths from being taken. Unless I get some advice that changes things, I think my next move is to compare the disassemblies for cases that work to those for cases that don't. > > > > On Tue, May 3, 2011 at 1:00 PM, Walter Bright <walter at digitalmars.com> > wrote: > > > > > > On 5/3/2011 5:43 AM, David Simcha wrote: > > > > Add asserts on that pointer value going out of range, and keep working > backwards until the point where the value goes wrong is discovered. > > > > > > Been trying to do that, but I think there are multiple places where this > is happening and the asserts are affecting codegen or timings just enough to prevent some. > > > > You can also do the simple: > > > > if (ptr == bad value) *((char*)0)=0; > > > > which doesn't perturb timings or code gen much. I use these often. The > debugger will tell you which one tripped. > > > > _______________________________________________ > > phobos mailing list > > phobos at puremagic.com > > http://lists.puremagic.com/mailman/listinfo/phobos > > > > _______________________________________________ > > phobos mailing list > > phobos at puremagic.com > > http://lists.puremagic.com/mailman/listinfo/phobos > > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110504/523df9dc/attachment-0001.html> |
Copyright © 1999-2021 by the D Language Foundation