August 28, 2013

Exceptional Exception Handling

The best way to write a reliable crash reporter on Mac OS X is to make it handle EXC_CRASH, but this will only work if you can handle the crash in another process. You can’t catch EXC_CRASH in-process. This question arises from time to time, and I saw it most recently in Apple radar 14845058.

The short story is that your process is already dead by the time EXC_CRASH is raised. The long story is interesting, though.

Hardware Traps to Mach Exceptions

EXC_CRASH isn’t like the other exception types in that it doesn’t originate as a hardware trap (or fault, interrupt, or exception, depending on terminology). Take EXC_BAD_INSTRUCTION, for example. On x86 (ARM is similar but the iOS kernel source isn’t public), that’s the Mach exception that corresponds to the #UD hardware exception, among others. You can see the genesis of EXC_BAD_INSTRUCTION in the kernel source at 10.8.4 xnu-2050.24.15/osfmk/i386/trap.c user_trap. ( T_INVALID_OPCODE is the constant for #UD.) When your code triggers #UD (perhaps via a ud2 mnemonic, which clang will generate for you if you call __builtin_trap()—on ARM, it gives you a trap mnemonic), it gets turned into an EXC_BAD_INSTRUCTION Mach exception, which will be delivered to the Mach exception handler registered for the thread, task, or host. The handlers are attempted in that order, and the first one that’s got a handler registered for the exception type gets to handle the exception. You can see this delivery mechanism in xnu-2050.24.15/osfmk/kern/exception.c exception_triage. The handler can be anything: in-process, out-of-process, or nonexistent.

Mach Exceptions to POSIX Signals

For the normal hardware crash types (everything in <mach/exception_types.h> from EXC_BAD_ACCESS through EXC_MACH_SYSCALL), there’s a host-level exception handler present, installed by the kernel, and whose handler runs in the kernel. The exception port is called ux_exception_port in the kernel, and it’s set up by xnu-2050.24.15/bsd/kern/bsd_init.c bsdinit_task. Note that it uses EXC_MASK_ALL, which does not include EXC_CRASH. The actual handler code is xnu-2050.24.15/bsd/uxkern/ux_exception.c catch_mach_exception_raise (this is probably where things will start to sound familiar if you’ve written your own Mach exception handler before) which, in conjunction with ux_exception and the processor-specific machine_exception, is responsible for mapping the Mach exception to a POSIX signal and sending that to the victim process. The EXC_BAD_INSTRUCTION example will be turned into SIGILL, for example. EXC_BAD_ACCESS, which could have started as #GP (T_GENERAL_PROTECTION) or #PF (T_PAGE_FAULT), maps to SIGSEGV or SIGBUS, depending on the circumstances that caused the trap. Concisely, for Mach exceptions that aren’t EXC_CRASH, unless you’ve got your own exception handler registered at the thread or task level, the in-kernel host-level exception handler will send your process a signal.

Lots of people writing Mach exception handlers set them up at the task level, and they use EXC_MASK_ALL or specific EXC_MASK values to pick the exception types they want to handle. This works if your exception handler is out-of-process, and if in-process, it works as well as an in-process handler can work (with the obvious caveat regarding exceptions on the exception handler thread). You’re not going to get a POSIX signal, but that’s probably fine, because if you’re handling exceptions through the Mach interface, you’re probably not trying to catch signals anyway.

Software-Based Termination

If you’re messing with EXC_CRASH, you probably know that a major drawback of this scheme is that it can only respond to crashes that originated as genuine hardware traps. abort() and all of the things that wind up calling abort() are not, they’re generated entirely in software. This is important for a crash reporter because lots of interesting crashes arise through this mechanism, such as assertion failures and runtime (C++ and Objective-C) exceptions. abort() is implemented in Libc-825.26/stdlib/FreeBSD/abort.c abort, and it raises SIGABRT all on its own, without ever triggering a hardware trap. That means that your program can catch these crashes in-process via the POSIX signal interface, but because it was never a Mach exception to begin with, there’s no opportunity to catch one.

This is where EXC_CRASH comes in. EXC_CRASH is a new (as of Mac OS X 10.5) exception type that’s only generated in one place: when a process is dying an abnormal death. In xnu-2050.24.15/bsd/kern/kern_exit.c proc_prepareexit, the logic says that if the process is exiting due to a signal that’s considered a crash (one that might generate a “ core” file, identified by the presence of SA_CORE in xnu-2050.24.15/bsd/sys/signalvar.h sigprop), an EXC_CRASH Mach exception will be raised for the task. Along with several other signals, the SIGILL, SIGSEGV, SIGBUS, and SIGABRT examples above are all core-generating, so they qualify for this treatment. By the time a process is exiting due to an unhandled signal, it’s a goner. It’s not going to be scheduled any more. That includes any Mach exception handler that was running on a thread in the process. This is why you can’t catch EXC_CRASH exceptions in the process itself: by the time an EXC_CRASH is generated, your process is no longer running. Indeed, in the bug report, you can see the abort() as an “upstream” caller of in-kernel process teardown code, passing through proc_prepareexit, exception_triage, and ultimately getting blocked waiting for a response to mach_exception_raise that will never come.

I recommend EXC_CRASH as the best way to handle crashes, but it absolutely requires an out-of-process handler, which is a more robust architecture for other reasons anyway. If your handler needs to be in-process for whatever reason (including being on a platform where you’re not supposed to be able to run more than one process), EXC_CRASH won’t work, but nothing’s stopping you from writing signal handlers, or from writing a Mach exception handler for all of the hardware-based exceptions and adding a SIGABRT handler to cover software-based crashes.

Leading by Example: Apple’s Crash Reporter

Apple’s Crash Reporter (or CrashReporter, or ReportCrash, after the name of its executable) is kind of fibbing when it tells you that the reason for your crash was “ EXC_CRASH (SIGABRT)”. Everything that it catches was caught via EXC_CRASH. When it catches an EXC_CRASH that originated as a hardware exception, it recovers the original Mach exception type from the exception codes passed to it, stashed away by proc_prepareexit, and it shows you that instead of EXC_CRASH. This is where most people’s first encounter with EXC_CRASH comes from, and based on preexisting experience with the Mach exception handling interface, it can be misleading.

ReportCrash is set as the default exception server for EXC_CRASH by launchd. launchd-442.26.2/src/core.c job_set_exception_port sets an internal variable, the_exception_server, the first time it sees a job that contains a Mach service definition that contains an ExceptionServer key with a boolean value (regardless of the actual value), or any time any Mach service definition has a value attribute that’s a dictionary regardless of the key. (This behavior seems odd to me too.) For a user launchd, the launch agent at /System/Library/LaunchAgents/com.apple.ReportCrash.plist provides the_exception_server in the form of a service named com.apple.ReportCrash. Subsequently, when launching any process (and user launchd is responsible for launching all processes in a user’s graphical login session), launchd-442.26.2/src/core.c job_setup_exception_port will default to the_exception_server as the task-level EXC_CRASH handler if no other exception port was specified. For the system launchd (which runs as the init process), the launch daemon at /System/Library/LaunchDaemons/com.apple.ReportCrash.Root.plist provides the_exception_server in the form of a service named com.apple.ReportCrash.DirectoryService, and once it’s detected, job_setup_exception_port immediately sets it as the host-level EXC_CRASH handler. Because task-level exception ports are inherited from parent processes by their children, this allows ReportCrash to run as the logged-in user for a crash in any process that descends from a user’s session, unless overridden by setting a different task-level EXC_CRASH handler. System-level coverage is provided by a ReportCrash instance that runs as root for any other process on the system descended from the root launchd ( init).

To handle user processes not descended from a user’s launchd, processes associated with a specific user can set the user’s com.apple.ReportCrash handler themselves. login does this to provide user-level coverage of terminal logins via SSH, for example. Finally, to avoid the deadlock problem that would otherwise arise, crashes in the user-level com.apple.ReportCrash process itself are addressed by a distinct user-level instance of ReportCrash operating under the com.apple.ReportCrash.Self service name.

In the pre-10.5, pre-EXC_CRASH days, Mach exceptions and POSIX signals were hopelessly conflated. It was possible for a process to handle POSIX signals gracefully and continue running but to still have the Crash Reporter interface appear because it was triggered by Mach exceptions originating from hardware traps, just like the POSIX signals. The introduction of EXC_CRASH provided the necessary separation between hardware exceptions, which may be handled allowing the process to continue on its merry way, and crashes, which need not originate in hardware at all but are considered terminal.

Take It Outside

If you’ve got the luxury of handling your crashes out-of-process, I strongly recommend doing so. Generic in-process crash handling has always been somewhat dangerous, because it involves trying to accomplish something in a process whose state is effectively unknown. Writing an in-process crash handler requires some extremely defensive programming tactics. For example, a crash may have occur because of an out-of-memory condition, or it may occurred while an allocator lock is held, so the handler needs to avoid allocating memory (which would be impossible in these cases), and may even need to pre-allocate resources. In practice, this means that you can’t rely on most of the standard library unless you have assurance that it will operate correctly even in an exception handler. Even system calls can’t be expected to behave correctly: if a process is out of file descriptors, it won’t be able to open a new file to save information about the crash. An in-process crash handler is probably one of the harsher environments imaginable. The most defensive programming still won’t provide 100% coverage of all crashes if the handler is in-process.

By contrast, an out-of-process crash handler doesn’t need to be nearly as defensive, because it’s isolated from the victim process. That makes such handlers much easier to write. Since it’s in control of its own resources, it can make ordinary use of the standard library (including allocators) and system calls. Its role is simply to perform a post-mortem on the guaranteed-dead crashed process. The EXC_CRASH design permits much fuller coverage than would ever be possible with an in-process design.