Thread overview
I think race condition exists in tango & phobos gc code
Sep 07, 2008
redsea
Sep 08, 2008
Sean Kelly
Sep 09, 2008
redsea
I'm wrong. Re: I think race condition exists in tango & phobos gc code
Sep 11, 2008
redsea
September 07, 2008
I have a programm wrote in D and run 24 * 7,  I found it would block one time or twice a week (without using CPU load), whenever I use strace to check if if block at system all, it continue run (strange ? )

and I can resume it use kill -SIGUSR2, so I think this situation may associated with gc. But why strace ?  I check the strace code, and found it would cause SIGSTOP to send, and I found SIGSTOP can not block by signal mask.

Then I check the lib, and I think the problem may cause by the following execute  order:

   thread A:                                              thread B:

   fullcollect
      thread_suspendAll
          suspend
                                                               thread_suspendHandler
                                                               sem_post( &suspendCount );

               ret from sem_wait( &suspendCount );
      do collect

      thread_resumeAll
               !! this signal would lost
               pthread_kill( t.m_addr, SIGUSR2 )

                                                               sigsuspend( &sigres );

thread B would block because of the SIGUSR2 lost.

then I check the phobos code, and the code is alike.

Now, I 'm trying to use semaphore to do resume, and would check if my programming run correctly.


Any suggest ?

September 08, 2008
redsea wrote:
> I have a programm wrote in D and run 24 * 7,  I found it would block one time or twice a week (without using CPU load), whenever I use strace to check if if block at system all, it continue run (strange ? )
> 
> and I can resume it use kill -SIGUSR2, so I think this situation may associated with gc. But why strace ?  I check the strace code, and found it would cause SIGSTOP to send, and I found SIGSTOP can not block by signal mask.  
> 
> Then I check the lib, and I think the problem may cause by the following execute  order:
> 
>    thread A:                                              thread B:           fullcollect       thread_suspendAll
>           suspend                                                                                                thread_suspendHandler
>                                                                sem_post( &suspendCount );
> 
>                ret from sem_wait( &suspendCount );         do collect
>             thread_resumeAll
>                !! this signal would lost
>                pthread_kill( t.m_addr, SIGUSR2 )
>                                                                                                                              sigsuspend( &sigres );         
> 
> thread B would block because of the SIGUSR2 lost.

SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers to tell the OS to block all signals while the handler is processing. The call to sigsuspend is supposed to manually change that for the signals requested.

> then I check the phobos code, and the code is alike.
> 
> Now, I 'm trying to use semaphore to do resume, and would check if my programming run correctly.

Thanks, please do.  If it really is a problem I'd be happy to change it.


Sean
September 09, 2008
Sean Kelly Wrote:

> SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers to tell the OS to block all signals while the handler is processing. The call to sigsuspend is supposed to manually change that for the signals requested.
> 
> > then I check the phobos code, and the code is alike.
> > 
> > Now, I 'm trying to use semaphore to do resume, and would check if my programming run correctly.
> 
> Thanks, please do.  If it really is a problem I'd be happy to change it.


I wrote a small programm kill and sigsuspend use the order as me metioned before, the signal is not lost.   So the real reason should hide more deep.

The version use semaphore finished, but I've to wait the adminstrator test & upload the programming.

I will do more check.

Thanks for your opinions .

September 11, 2008
Sean Kelly Wrote:

> 
> SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers to tell the OS to block all signals while the handler is processing. The call to sigsuspend is supposed to manually change that for the signals requested.

I'm wrong.

Indeed the programming has two components, client & server, both is multi thread. I was reported that two components have same problem.

After check, I found the client version is correct, running stable, that the bug must be nothing about tango.

Sorry !