View mode: basic / threaded / horizontal-split · Log in · Help
September 07, 2008
I think race condition exists in tango & phobos gc code
I have a programm wrote in D and run 24 * 7,  I found it would block one time or twice a week (without using CPU load), whenever I use strace to check if if block at system all, it continue run (strange ? )

and I can resume it use kill -SIGUSR2, so I think this situation may associated with gc. But why strace ?  I check the strace code, and found it would cause SIGSTOP to send, and I found SIGSTOP can not block by signal mask.  

Then I check the lib, and I think the problem may cause by the following execute  order:

  thread A:                                              thread B:     
  
  fullcollect 
     thread_suspendAll
         suspend                                 
                                                              thread_suspendHandler
                                                              sem_post( &suspendCount );

              ret from sem_wait( &suspendCount );   
     do collect
     
     thread_resumeAll
              !! this signal would lost
              pthread_kill( t.m_addr, SIGUSR2 )
                                                             
                                                              sigsuspend( &sigres );         

thread B would block because of the SIGUSR2 lost.

then I check the phobos code, and the code is alike.

Now, I 'm trying to use semaphore to do resume, and would check if my programming run correctly.


Any suggest ?
September 08, 2008
Re: I think race condition exists in tango & phobos gc code
redsea wrote:
> I have a programm wrote in D and run 24 * 7,  I found it would block one time or twice a week (without using CPU load), whenever I use strace to check if if block at system all, it continue run (strange ? )
> 
> and I can resume it use kill -SIGUSR2, so I think this situation may associated with gc. But why strace ?  I check the strace code, and found it would cause SIGSTOP to send, and I found SIGSTOP can not block by signal mask.  
> 
> Then I check the lib, and I think the problem may cause by the following execute  order:
> 
>    thread A:                                              thread B:     
>    
>    fullcollect 
>       thread_suspendAll
>           suspend                                 
>                                                                thread_suspendHandler
>                                                                sem_post( &suspendCount );
> 
>                ret from sem_wait( &suspendCount );   
>       do collect
>       
>       thread_resumeAll
>                !! this signal would lost
>                pthread_kill( t.m_addr, SIGUSR2 )
>                                                               
>                                                                sigsuspend( &sigres );         
> 
> thread B would block because of the SIGUSR2 lost.

SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers 
to tell the OS to block all signals while the handler is processing. 
The call to sigsuspend is supposed to manually change that for the 
signals requested.

> then I check the phobos code, and the code is alike.
> 
> Now, I 'm trying to use semaphore to do resume, and would check if my programming run correctly.

Thanks, please do.  If it really is a problem I'd be happy to change it.


Sean
September 09, 2008
Re: I think race condition exists in tango & phobos gc code
Sean Kelly Wrote:

> SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers 
> to tell the OS to block all signals while the handler is processing. 
> The call to sigsuspend is supposed to manually change that for the 
> signals requested.
> 
> > then I check the phobos code, and the code is alike.
> > 
> > Now, I 'm trying to use semaphore to do resume, and would check if my programming run correctly.
> 
> Thanks, please do.  If it really is a problem I'd be happy to change it.


I wrote a small programm kill and sigsuspend use the order as me metioned before, the signal is not lost.   So the real reason should hide more deep.

The version use semaphore finished, but I've to wait the adminstrator test & upload the programming.

I will do more check.

Thanks for your opinions .
September 11, 2008
I'm wrong. Re: I think race condition exists in tango & phobos gc code
Sean Kelly Wrote:

> 
> SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers 
> to tell the OS to block all signals while the handler is processing. 
> The call to sigsuspend is supposed to manually change that for the 
> signals requested.

I'm wrong.

Indeed the programming has two components, client & server, both is multi thread. I was reported that two components have same problem.

After check, I found the client version is correct, running stable, that the bug must be nothing about tango.

Sorry !
Top | Discussion index | About this forum | D home