September 29, 2009
== Quote from Sean Kelly (sean@invisibleduck.org)'s article
>
> One thing I'm not entirely sure about is whether the signal handler will always
> have a valid, C-style call stack tracing back into user code.  These errors are
> triggered by hardware, and I really don't know what kind of tricks are common
> at that level of OS code.  longjmp() doesn't have this problem because it doesn't
> care about the call stack--it just swaps some registers and executes a JMP.  I
> don't suppose anyone here knows more about the feasibility of throwing
> exceptions from signal handlers at all?  I'll ask around some OS groups and
> see what people say.

I was right, it is illegal to throw an exception from a signal handler.  And worse,
it's illegal to call malloc from a signal handler, so you can't safely create an
exception object anyway.  Heck, I'm not sure it's even safe to perform IO from
a signal handler, so tracing directly from within the handler won't even work
reliably.  In short, while I'm totally fine with people using this in their own
code, it's too unreliable to make an "official" solution by adding it to Druntime.
September 29, 2009
Sean Kelly wrote:
> == Quote from Jeremie Pelletier (jeremiep@gmail.com)'s article
>> Andrei Alexandrescu wrote:
>>> Jeremie Pelletier wrote:
>>>>> Is this Linux specific? what about other *nix systems, like BSD and
>>>>> solaris?
>>>> Signal handler are standard to most *nix platforms since they're part
>>>> of the posix C standard libraries, maybe some platforms will require a
>>>> special handling but nothing impossible to do.
>>> Let me write a message on behalf of Sean Kelly. He wrote that to Walter
>>> and myself this morning, then I suggested him to post it but probably he
>>> is off email for a short while. Hopefully the community will find a
>>> solution to the issue he's raising. Let me post this:
>>>
>>> ===================
>>> Sean Kelly wrote:
>>>
>>> There's one minor problem with his code.  It's not safe to throw an
>>> exception from a signal handler.  Here's a quote from the POSIX spec at
>>> opengroup.org:
>>>
>>> "In order to prevent errors arising from interrupting non-reentrant
>>> function calls, applications should protect calls to these functions
>>> either by blocking the appropriate signals or through the use of some
>>> programmatic semaphore (see semget() , sem_init() , sem_open() , and so
>>> on). Note in particular that even the "safe" functions may modify errno;
>>> the signal-catching function, if not executing as an independent thread,
>>> may want to save and restore its value. Naturally, the same principles
>>> apply to the reentrancy of application routines and asynchronous data
>>> access. Note thatlongjmp() and siglongjmp() are not in the list of
>>> reentrant functions. This is because the code executing after longjmp()
>>> and siglongjmp() can call any unsafe functions with the same danger as
>>> calling those unsafe functions directly from the signal handler.
>>> Applications that use longjmp() andsiglongjmp() from within signal
>>> handlers require rigorous protection in order to be portable."
>>>
>>> If this were an acceptable approach it would have been in druntime ages
>>> ago :-)
>>> ===================
>> Yes but the segfault signal handler is not made to design code that can
>> live with these exceptions, its just a feature to allow segfaults to be
>> sent to the crash handler to get a backtrace dump. Even on windows while
>> you can recover from access violations, its generally a bad idea to
>> allow for bugs to be turned into features.
> 
> I don't think it's fair to compare Windows to Unix here because, as far as
> I know, Windows (ie. Win32, etc) was built with exceptions in mind (thanks to
> SEH), while Unix was not.  So while the Windows kernel may theoretically be fine
> with an exception being thrown from within kernel code, this isn't true of Unix.
> 
> It's true that as long as only Errors are thrown (and thus that the app intends
> to terminate), things aren't as bad as they could be.  Worst case, some mutex
> in libc is left locked or in some weird state and code executed during stack
> unwinding or when trying to report the error causes the app to hang instead
> of terminate.  And this risk is somewhat mitigated because I'd expect most
> of these errors to occur within user code anyway.
> 
> One thing I'm not entirely sure about is whether the signal handler will always
> have a valid, C-style call stack tracing back into user code.  These errors are
> triggered by hardware, and I really don't know what kind of tricks are common
> at that level of OS code.  longjmp() doesn't have this problem because it doesn't
> care about the call stack--it just swaps some registers and executes a JMP.  I
> don't suppose anyone here knows more about the feasibility of throwing
> exceptions from signal handlers at all?  I'll ask around some OS groups and
> see what people say.

I haven't had any problems so far, the stack trace generated was always valid and similar to what gdb would output. But I agree that trying to recover from these exceptions is a *bad* idea in so many ways.

From what I know, the kernel alters the stack frame of the signal handler to make us believe we called it ourselves. Returning from the signal handler therefore jumps to the routine from which the signal was originally raised, without the kernel being aware of it.

This is a bit different than how SEH is handled, but has a lot in common to it:

From the research I did about SEH internals, its just built on top of interrupt handlers. The hardware raises an exception (access violation, etc), jumps into a kernel handler for the corresponding interrupt, it there looks up the base of the stack for a pointer to a struct containing a handler function and a handler table which is set and restored by try blocks and calls the exception handler (_d_framehandler in our case) with the appropriate parameters. From there the kernel decides what to do based on the return code of the framehandler.

The signal handler model is therefore quite acceptable to build exception handling on top of. We just may want to also manually generate a core dump before throwing the exception to support postmortem debugging.
September 29, 2009
Sean Kelly wrote:
> == Quote from Sean Kelly (sean@invisibleduck.org)'s article
>> One thing I'm not entirely sure about is whether the signal handler will always
>> have a valid, C-style call stack tracing back into user code.  These errors are
>> triggered by hardware, and I really don't know what kind of tricks are common
>> at that level of OS code.  longjmp() doesn't have this problem because it doesn't
>> care about the call stack--it just swaps some registers and executes a JMP.  I
>> don't suppose anyone here knows more about the feasibility of throwing
>> exceptions from signal handlers at all?  I'll ask around some OS groups and
>> see what people say.
> 
> I was right, it is illegal to throw an exception from a signal handler.  And worse,
> it's illegal to call malloc from a signal handler, so you can't safely create an
> exception object anyway.  Heck, I'm not sure it's even safe to perform IO from
> a signal handler, so tracing directly from within the handler won't even work
> reliably.  In short, while I'm totally fine with people using this in their own
> code, it's too unreliable to make an "official" solution by adding it to Druntime.

Weird, it works just fine for me. Maybe its because the exception is always caught in the thread's entry point, i never tried to let such an exception unwind past the entry point. I haven't tried malloc or any I/O either.

There still should be a way to grab the backtrace and context data from the hidden ucontext_* parameter and do something with it after returning from the signal handler.

The whole idea of a crash handler is to limit the number of times you need to do postmortem debugging after a crash, or launch the process again within the debugger.
September 29, 2009
bearophile wrote:
> If even Andrei, a quite intelligent person that has written big books
> on C++, may be wrong on such a basic thing, then I think there's a
> problem.

Not everyone is an expert on everything, and how vptrs and vtbl[]s and casting actually work for multiple inheritance is far from being a basic thing.

Furthermore, different compilers implement these things differently. Last I heard, Java did it the way Andrei described.

Don Clugston wrote an article a few years ago on this, and found a wide variety of implementation strategies. The Digital Mars one was the fastest <g>.
September 29, 2009
== Quote from Jeremie Pelletier (jeremiep@gmail.com)'s article
> Sean Kelly wrote:
> > == Quote from Sean Kelly (sean@invisibleduck.org)'s article
> >> One thing I'm not entirely sure about is whether the signal handler will always
> >> have a valid, C-style call stack tracing back into user code.  These errors are
> >> triggered by hardware, and I really don't know what kind of tricks are common
> >> at that level of OS code.  longjmp() doesn't have this problem because it doesn't
> >> care about the call stack--it just swaps some registers and executes a JMP.  I
> >> don't suppose anyone here knows more about the feasibility of throwing
> >> exceptions from signal handlers at all?  I'll ask around some OS groups and
> >> see what people say.
> >
> > I was right, it is illegal to throw an exception from a signal handler.  And worse,
> > it's illegal to call malloc from a signal handler, so you can't safely create an
> > exception object anyway.  Heck, I'm not sure it's even safe to perform IO from
> > a signal handler, so tracing directly from within the handler won't even work
> > reliably.  In short, while I'm totally fine with people using this in their own
> > code, it's too unreliable to make an "official" solution by adding it to Druntime.
> Weird, it works just fine for me. Maybe its because the exception is always caught in the thread's entry point, i never tried to let such an exception unwind past the entry point. I haven't tried malloc or any I/O either.

I think in practice, the issue is simply that malloc and IO routines aren't on
the list of reentrant functions, so if a signal is called from within one of these
routines then the signal handler trying to call the same routine could cause
Bad Things to happen.  This actually comes up in our GC code on Linux
because threads are suspended for the collection via signals.  If one of
these threads is suspended within a non-reentrant library routine and the
GC code calls the same routine it can crash or deadlock on an internal
mutex (the latter actually happened on OSX until I changed how GC works
there).  This is kind of a weird issue, since in this case any thread can screw
with the GC thread, even though the GC thread itself never enters a signal
handler.  This is something that never occurred to me before--it was Fawzi
that figured out why OSX apps were deadlocking for no reason whatsoever
(I *think* this was pre-Druntime, though I can't recall precisely).

In short, you may never actually run into a problem using these functions,
and if they work for you then that's all that matters.  I'm just hesitant to
roll something into Druntime that is "undefined" according to a spec and
has only been verified to work through experimentation by a subset of
D users.  ie. I'd rather Druntime be a tad gimped and always work than
be super fancy and not work for some people.  YMMV.

> There still should be a way to grab the backtrace and context data from the hidden ucontext_* parameter and do something with it after returning from the signal handler.

Yeah, I saw one suggestion that you could have a thread blocked waiting for (in this case) backtrace data.  So another thread could do the trace and no worries about signal handler limitations.  Still, this seems like a pretty heavyweight approach.

If there were some way to cache the trace data and then have the same thread process it I'd love to know how.  I ran into this "can't throw exceptions from a signal handler" issue at a previous job, and finally gave up on the idea in frustration after not being able to come up with a decent workaround.

> The whole idea of a crash handler is to limit the number of times you need to do postmortem debugging after a crash, or launch the process again within the debugger.

Yup.  And as a server programmer, I think getting backtraces within a log file is totally awesome, since dealing with a core dump is difficult at best for such apps.  In fact I'd probably use your approach within my own code, since it seems to work.
September 29, 2009
Jeremie Pelletier wrote:
> struct NonNull(C) if(is(C == class)) {
>     C ref;
>     invariant() { assert(ref !is null); }
>     T opDot() { return ref; }
> }

This only catches null errors at runtime.  The whole point of a non-null type is to catch null errors at compile time.


-- 
Rainer Deyke - rainerd@eldwood.com
September 29, 2009
Sean Kelly wrote:
> == Quote from Jeremie Pelletier (jeremiep@gmail.com)'s article
>> Sean Kelly wrote:
>>> == Quote from Sean Kelly (sean@invisibleduck.org)'s article
>>>> One thing I'm not entirely sure about is whether the signal handler will always
>>>> have a valid, C-style call stack tracing back into user code.  These errors are
>>>> triggered by hardware, and I really don't know what kind of tricks are common
>>>> at that level of OS code.  longjmp() doesn't have this problem because it doesn't
>>>> care about the call stack--it just swaps some registers and executes a JMP.  I
>>>> don't suppose anyone here knows more about the feasibility of throwing
>>>> exceptions from signal handlers at all?  I'll ask around some OS groups and
>>>> see what people say.
>>> I was right, it is illegal to throw an exception from a signal handler.  And worse,
>>> it's illegal to call malloc from a signal handler, so you can't safely create an
>>> exception object anyway.  Heck, I'm not sure it's even safe to perform IO from
>>> a signal handler, so tracing directly from within the handler won't even work
>>> reliably.  In short, while I'm totally fine with people using this in their own
>>> code, it's too unreliable to make an "official" solution by adding it to Druntime.
>> Weird, it works just fine for me. Maybe its because the exception is
>> always caught in the thread's entry point, i never tried to let such an
>> exception unwind past the entry point. I haven't tried malloc or any I/O
>> either.
> 
> I think in practice, the issue is simply that malloc and IO routines aren't on
> the list of reentrant functions, so if a signal is called from within one of these
> routines then the signal handler trying to call the same routine could cause
> Bad Things to happen.  This actually comes up in our GC code on Linux
> because threads are suspended for the collection via signals.  If one of
> these threads is suspended within a non-reentrant library routine and the
> GC code calls the same routine it can crash or deadlock on an internal
> mutex (the latter actually happened on OSX until I changed how GC works
> there).  This is kind of a weird issue, since in this case any thread can screw
> with the GC thread, even though the GC thread itself never enters a signal
> handler.  This is something that never occurred to me before--it was Fawzi
> that figured out why OSX apps were deadlocking for no reason whatsoever
> (I *think* this was pre-Druntime, though I can't recall precisely).
> 
> In short, you may never actually run into a problem using these functions,
> and if they work for you then that's all that matters.  I'm just hesitant to
> roll something into Druntime that is "undefined" according to a spec and
> has only been verified to work through experimentation by a subset of
> D users.  ie. I'd rather Druntime be a tad gimped and always work than
> be super fancy and not work for some people.  YMMV.

I agree, I don't mind occasional crashes within the crash handler itself if it ever comes to that, at this point things are already going pretty bad anyways and the process is already going to exit soon enough. It could be confusing as hell to library users if they don't know this might happen in rare cases, so I understand keeping it away from Druntime until a proven solution is found.

>> There still should be a way to grab the backtrace and context data from
>> the hidden ucontext_* parameter and do something with it after returning
>> from the signal handler.
> 
> Yeah, I saw one suggestion that you could have a thread blocked waiting
> for (in this case) backtrace data.  So another thread could do the trace
> and no worries about signal handler limitations.  Still, this seems like a
> pretty heavyweight approach.

Eh, I'm not going that way either :) Maybe spawn another process with some basic infos collected by the signal handler (ie registers, loaded modules and backtrace) and let that other process deal with generating a crash window while we gracefully shut down with a core dump. That's also a heavyweight idea but its only happening after a crash, not while waiting for it.

> If there were some way to cache the trace data and then have the same
> thread process it I'd love to know how.  I ran into this "can't throw
> exceptions from a signal handler" issue at a previous job, and finally
> gave up on the idea in frustration after not being able to come up with
> a decent workaround.
> 
>> The whole idea of a crash handler is to limit the number of times you
>> need to do postmortem debugging after a crash, or launch the process
>> again within the debugger.
> 
> Yup.  And as a server programmer, I think getting backtraces within a log
> file is totally awesome, since dealing with a core dump is difficult at best
> for such apps.  In fact I'd probably use your approach within my own code,
> since it seems to work.

Yeah I'm not much into post-mortem debugging either, I like running within the debugger or having a convenient crash window. It's also neat thing to use when you distribute your executable since you can implement a smtp mailer for the crash reports instead of the crash window.
September 29, 2009
Rainer Deyke wrote:
> Jeremie Pelletier wrote:
>> struct NonNull(C) if(is(C == class)) {
>>     C ref;
>>     invariant() { assert(ref !is null); }
>>     T opDot() { return ref; }
>> }
> 
> This only catches null errors at runtime.  The whole point of a non-null
> type is to catch null errors at compile time.
> 

Thats what flow analysis is for, since these are mostly uninitialized variables rather than null ones.

Its dead easy to insert null into a nonnull reference, and since you expect the type to never be null its the last thing you're gonna check. If variables are properly initialized, you'll never get null where you don't expect it, and those are checked at compile time too, and work on every type.
September 29, 2009
Jeremie Pelletier:

> Its dead easy to insert null into a nonnull reference,

If it's easy to put a null into a nonnull by *mistake*, then that system needs to be designed better.


> and since you expect the type to never be null its the last thing you're gonna check.

I agree, but I think in a well designed system such situations are really uncommon.


> If variables are properly initialized, you'll never get null where you don't expect it, and those are checked at compile time too, and work on every type.

Cyclone is an example of language where there is both flow analysis (in a very C-like language that allows some kinds of gotos too, maybe someone here may read their source code and adapt it to D. [One of the weirder characteristics of open source programs is that hardly anyone ever reads/copies code/solutions from other open source projects; and I don't think those stupid/idiotic differences in OSS licences are enough to justify such behaviours. I think there's also a strong amount of NIH syndrome. So I don't hold my breath for the day when D will start working with mono C# devs to design a better GC that can be tuned and used for both such open source languages/implementations, that have different but not totally different GC needs]) and optional nonnull references (well, pointers). I think Cyclone shows how to design a safer C-like language. And making D safer is simpler than making C safer, despite D is more complex than C.

Bye,
bearophile
September 30, 2009
Andrei Alexandrescu wrote:
> I seem to recall that interface dispach in D does a linear search in the interfaces list, so you may want to repeat your tests with a variable number of interfaces, and a variable position of the interface being used.

Such numbers are not interesting to me. On average, each class I write implements one interface. I rarely use inheritance and interfaces in the same class.

But your information is incorrect. Here's what happens:

object of class A
| vtable
|   | classinfo pointer
|   | methods...
| fields...
| interface vtable
|   | struct Interface*
|   | methods

struct Interface
{
   ptrdiff_t this_offset;
   ClassInfo interfaceInfo;
}

There are two ways to implement interface calls with this paradigm. The compiler way:

interface I
{
   void doStuff(int arg);
}
class A
{
   void doStuff(int arg) { writefln("do stuff! %s", arg); }

   // this method actually goes into the interface vtable
   ReturnType!doStuff __I_doStuff(ParameterTypeTuple!doStuff args)
   {
      auto iface = cast(Interface*)this.vtable[0];
      this = this + iface.this_offset;
      return doStuff(args);
   }
}


You can also do it with the runtime, but that's a lot harder. It would be effectively the same code.