Thread overview
[phobos] std.parallelism unit test hang
Jun 05, 2011
Brad Roberts
Jun 05, 2011
David Simcha
Jun 05, 2011
Brad Roberts
Jun 05, 2011
David Simcha
Jun 05, 2011
Brad Roberts
Jun 05, 2011
Robert Clipsham
Jun 05, 2011
Brad Roberts
June 04, 2011
Could be the same as one of the previous threads.  Sorry for being a little lazy (it's that kinda weekend):

I just noticed that the phobos linux 64/32 build had been running for almost 10 hours (I guess I need to add some timeout logic still).  I attached gdb to get a stack trace:

#0  0x55573425 in __kernel_vsyscall ()
#1  0x555b0469 in __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/lowlevellock.S:142
#2  0x555ab659 in _L_lock_835 () from /lib32/libpthread.so.0
#3  0x555ab4eb in __pthread_mutex_lock (mutex=0x8139264) at pthread_mutex_lock.c:82
#4  0x080f4065 in _d_criticalenter ()
#5  0x080d4ea3 in std.parallelism.__unittest2() ()
#6  0x080ed3a5 in
std.parallelism.__T19ParallelForeachTaskTS3std5range13__T4iotaTiTiZ4iota6ResultTDFKiZiZ.ParallelForeachTask.impl() ()
#7  0x080d3253 in std.parallelism.AbstractTask.job() ()
#8  0x080d3523 in std.parallelism.TaskPool.tryDeleteExecute() ()
#9  0x080ed1e1 in
std.parallelism.__T19ParallelForeachImplTS3std5range13__T4iotaTiTiZ4iota6ResultTDFKiZiZ.ParallelForeachImpl.__T17ResubmittingTasksZ.submitAndExecute()
()
#10 0x080e44a2 in std.parallelism.__T15ParallelForeachTS3std5range13__T4iotaTiTiZ4iota6ResultZ.ParallelForeach.opApply() ()
#11 0x080d49db in std.parallelism.__unittest2() ()
#12 0x080edba9 in std.parallelism.__modtest() ()
#13 0x080ff9dc in core.runtime.runModuleUnitTests() ()
#14 0x080efca2 in object.ModuleInfo.opApply() ()
#15 0x080ff8f7 in runModuleUnitTests ()
#16 0x080f4980 in rt.dmain2.main() ()
#17 0x080f45e8 in rt.dmain2.main() ()
#18 0x080f4594 in main ()

Without having looked at the code, the stack suggests a recursive call resulting in a deadlock.  Notice frame 11 and 5 are in the same function.

Have fun with it.

Later,
Brad
June 04, 2011
Can you please tell me what hardware the box this failed on is running? Also, if by some chance you still have GDB attached, registers and disassembly would be tremendously useful.

Also, I don't think it's a recursive call leading to a deadlock.  The trace makes perfect sense to me.  unittest2 calls into opApply, which calls into the foreach body code of unittest2.  There's something very strange going on with the test where I append to an Appender from multiple threads, manually synchronizing.  It looks like this is related.  I wonder if there's some subtle reason why Appender can't be appended to from multiple threads, even if you manually synchronize, or if synchronizing on a global mutex, i.e. synchronized { /* Do stuff */ } is subtly broken.

On Sat, Jun 4, 2011 at 10:15 PM, Brad Roberts <braddr at puremagic.com> wrote:

> Could be the same as one of the previous threads.  Sorry for being a little lazy (it's that kinda weekend):
>
> I just noticed that the phobos linux 64/32 build had been running for
> almost 10 hours (I guess I need to add some
> timeout logic still).  I attached gdb to get a stack trace:
>
> #0  0x55573425 in __kernel_vsyscall ()
> #1  0x555b0469 in __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/lowlevellock.S:142
> #2  0x555ab659 in _L_lock_835 () from /lib32/libpthread.so.0
> #3  0x555ab4eb in __pthread_mutex_lock (mutex=0x8139264) at
> pthread_mutex_lock.c:82
> #4  0x080f4065 in _d_criticalenter ()
> #5  0x080d4ea3 in std.parallelism.__unittest2() ()
> #6  0x080ed3a5 in
> std.parallelism.__T19ParallelForeachTaskTS3std5range13__T4iotaTiTiZ4iota6ResultTDFKiZiZ.ParallelForeachTask.impl()
> ()
> #7  0x080d3253 in std.parallelism.AbstractTask.job() ()
> #8  0x080d3523 in std.parallelism.TaskPool.tryDeleteExecute() ()
> #9  0x080ed1e1 in
>
> std.parallelism.__T19ParallelForeachImplTS3std5range13__T4iotaTiTiZ4iota6ResultTDFKiZiZ.ParallelForeachImpl.__T17ResubmittingTasksZ.submitAndExecute()
> ()
> #10 0x080e44a2 in
> std.parallelism.__T15ParallelForeachTS3std5range13__T4iotaTiTiZ4iota6ResultZ.ParallelForeach.opApply()
> ()
> #11 0x080d49db in std.parallelism.__unittest2() ()
> #12 0x080edba9 in std.parallelism.__modtest() ()
> #13 0x080ff9dc in core.runtime.runModuleUnitTests() ()
> #14 0x080efca2 in object.ModuleInfo.opApply() ()
> #15 0x080ff8f7 in runModuleUnitTests ()
> #16 0x080f4980 in rt.dmain2.main() ()
> #17 0x080f45e8 in rt.dmain2.main() ()
> #18 0x080f4594 in main ()
>
> Without having looked at the code, the stack suggests a recursive call
> resulting in a deadlock.  Notice frame 11 and 5
> are in the same function.
>
> Have fun with it.
>
> Later,
> Brad
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110604/21e58038/attachment.html>
June 04, 2011
The process is long gone, sorry.

The box has one amd athlon 64 x2 dual core processor, 1ghz, 512kb cache.

The recursion/deadlock was purely a guess based on the stack.  Not shocking it's a bad guess.

On 6/4/2011 7:29 PM, David Simcha wrote:
> Can you please tell me what hardware the box this failed on is running?  Also, if by some chance you still have GDB attached, registers and disassembly would be tremendously useful.
> 
> Also, I don't think it's a recursive call leading to a deadlock.  The trace makes perfect sense to me.  unittest2 calls into opApply, which calls into the foreach body code of unittest2.  There's something very strange going on with the test where I append to an Appender from multiple threads, manually synchronizing.  It looks like this is related.  I wonder if there's some subtle reason why Appender can't be appended to from multiple threads, even if you manually synchronize, or if synchronizing on a global mutex, i.e. synchronized { /* Do stuff */ } is subtly broken.
> 
> On Sat, Jun 4, 2011 at 10:15 PM, Brad Roberts <braddr at puremagic.com <mailto:braddr at puremagic.com>> wrote:
> 
>     Could be the same as one of the previous threads.  Sorry for being a little lazy (it's that kinda weekend):
> 
>     I just noticed that the phobos linux 64/32 build had been running for almost 10 hours (I guess I need to add some
>     timeout logic still).  I attached gdb to get a stack trace:
> 
>     #0  0x55573425 in __kernel_vsyscall ()
>     #1  0x555b0469 in __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/lowlevellock.S:142
>     #2  0x555ab659 in _L_lock_835 () from /lib32/libpthread.so.0
>     #3  0x555ab4eb in __pthread_mutex_lock (mutex=0x8139264) at pthread_mutex_lock.c:82
>     #4  0x080f4065 in _d_criticalenter ()
>     #5  0x080d4ea3 in std.parallelism.__unittest2() ()
>     #6  0x080ed3a5 in
>     std.parallelism.__T19ParallelForeachTaskTS3std5range13__T4iotaTiTiZ4iota6ResultTDFKiZiZ.ParallelForeachTask.impl() ()
>     #7  0x080d3253 in std.parallelism.AbstractTask.job() ()
>     #8  0x080d3523 in std.parallelism.TaskPool.tryDeleteExecute() ()
>     #9  0x080ed1e1 in
>     std.parallelism.__T19ParallelForeachImplTS3std5range13__T4iotaTiTiZ4iota6ResultTDFKiZiZ.ParallelForeachImpl.__T17ResubmittingTasksZ.submitAndExecute()
>     ()
>     #10 0x080e44a2 in
>     std.parallelism.__T15ParallelForeachTS3std5range13__T4iotaTiTiZ4iota6ResultZ.ParallelForeach.opApply() ()
>     #11 0x080d49db in std.parallelism.__unittest2() ()
>     #12 0x080edba9 in std.parallelism.__modtest() ()
>     #13 0x080ff9dc in core.runtime.runModuleUnitTests() ()
>     #14 0x080efca2 in object.ModuleInfo.opApply() ()
>     #15 0x080ff8f7 in runModuleUnitTests ()
>     #16 0x080f4980 in rt.dmain2.main() ()
>     #17 0x080f45e8 in rt.dmain2.main() ()
>     #18 0x080f4594 in main ()
> 
>     Without having looked at the code, the stack suggests a recursive call resulting in a deadlock.  Notice frame 11 and 5
>     are in the same function.
> 
>     Have fun with it.
> 
>     Later,
>     Brad

June 04, 2011
On 6/4/2011 11:50 PM, Brad Roberts wrote:
> The process is long gone, sorry.
>
> The box has one amd athlon 64 x2 dual core processor, 1ghz, 512kb cache.
????? I'm pretty sure Athlon 64 X2's don't exist at this slow a clock speed, unless you're underclocking.  Also funny because, except for clock speed, this is my hardware as well and I can't reproduce this bug after executing the unittests in a loop ~13,000 times.
June 04, 2011
On 6/4/2011 8:55 PM, David Simcha wrote:
> On 6/4/2011 11:50 PM, Brad Roberts wrote:
>> The process is long gone, sorry.
>>
>> The box has one amd athlon 64 x2 dual core processor, 1ghz, 512kb cache.
> ????? I'm pretty sure Athlon 64 X2's don't exist at this slow a clock speed, unless you're underclocking.  Also funny because, except for clock speed, this is my hardware as well and I can't reproduce this bug after executing the unittests in a loop ~13,000 times.

Hrm, looking at the kernel boot logs, it claims:

[    0.000000] Detected 2004.383 MHz processor.
[    0.010011] Calibrating delay loop (skipped), value calculated using timer frequency.. 4008.76 BogoMIPS (lpj=20043830)

>From /proc/cpuinfo (where I got the 1ghz figure).  Looks like it halved the two values, which seems wrong:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 75
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping        : 2
cpu MHz         : 1000.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 2004.38
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 75
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping        : 2
cpu MHz         : 1000.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 2004.38
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
June 05, 2011
On 5 June 2011 05:04, Brad Roberts <braddr at puremagic.com> wrote:

> On 6/4/2011 8:55 PM, David Simcha wrote:
> > On 6/4/2011 11:50 PM, Brad Roberts wrote:
> >> The process is long gone, sorry.
> >>
> >> The box has one amd athlon 64 x2 dual core processor, 1ghz, 512kb cache.
> > ????? I'm pretty sure Athlon 64 X2's don't exist at this slow a clock
> speed, unless you're underclocking.  Also funny
> > because, except for clock speed, this is my hardware as well and I can't
> reproduce this bug after executing the
> > unittests in a loop ~13,000 times.
>
> Hrm, looking at the kernel boot logs, it claims:
>
> [    0.000000] Detected 2004.383 MHz processor.
> [    0.010011] Calibrating delay loop (skipped), value calculated using
> timer frequency.. 4008.76 BogoMIPS (lpj=20043830)
>
> From /proc/cpuinfo (where I got the 1ghz figure).  Looks like it halved the two values, which seems wrong:
>

Do you have cpufreq installed? Most distros have it by default and under-clock your processor when it's not active (if/when it does it depends on which policy is enabled).

-- 
Robert
http://octarineparrot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110605/8b07cbe7/attachment.html>
June 05, 2011
On 6/5/2011 12:21 PM, Robert Clipsham wrote:
> On 5 June 2011 05:04, Brad Roberts <braddr at puremagic.com <mailto:braddr at puremagic.com>> wrote:
> 
>     On 6/4/2011 8:55 PM, David Simcha wrote:
>     > On 6/4/2011 11:50 PM, Brad Roberts wrote:
>     >> The process is long gone, sorry.
>     >>
>     >> The box has one amd athlon 64 x2 dual core processor, 1ghz, 512kb cache.
>     > ????? I'm pretty sure Athlon 64 X2's don't exist at this slow a clock speed, unless you're underclocking.  Also funny
>     > because, except for clock speed, this is my hardware as well and I can't reproduce this bug after executing the
>     > unittests in a loop ~13,000 times.
> 
>     Hrm, looking at the kernel boot logs, it claims:
> 
>     [    0.000000] Detected 2004.383 MHz processor.
>     [    0.010011] Calibrating delay loop (skipped), value calculated using timer frequency.. 4008.76 BogoMIPS
>     (lpj=20043830)
> 
>     >From /proc/cpuinfo (where I got the 1ghz figure).  Looks like it halved the two values, which seems wrong:
> 
> 
> Do you have cpufreq installed? Most distros have it by default and under-clock your processor when it's not active (if/when it does it depends on which policy is enabled).
> 
> -- 
> Robert
> http://octarineparrot.com/

Nope.