August 28, 2013

Exceptional Exception Handling

The best way to write a reliable crash reporter on Mac OS X is to make it handle EXC_CRASH, but this will only work if you can handle the crash in another process. You can’t catch EXC_CRASH in-process. This question arises from time to time, and I saw it most recently in Apple radar 14845058.

The short story is that your process is already dead by the time EXC_CRASH is raised. The long story is interesting, though.

Hardware Traps to Mach Exceptions

EXC_CRASH isn’t like the other exception types in that it doesn’t originate as a hardware trap (or fault, interrupt, or exception, depending on terminology). Take EXC_BAD_INSTRUCTION, for example. On x86 (ARM is similar but the iOS kernel source isn’t public), that’s the Mach exception that corresponds to the #UD hardware exception, among others. You can see the genesis of EXC_BAD_INSTRUCTION in the kernel source at 10.8.4 xnu-2050.24.15/osfmk/i386/trap.c user_trap. ( T_INVALID_OPCODE is the constant for #UD.) When your code triggers #UD (perhaps via a ud2 mnemonic, which clang will generate for you if you call __builtin_trap()—on ARM, it gives you a trap mnemonic), it gets turned into an EXC_BAD_INSTRUCTION Mach exception, which will be delivered to the Mach exception handler registered for the thread, task, or host. The handlers are attempted in that order, and the first one that’s got a handler registered for the exception type gets to handle the exception. You can see this delivery mechanism in xnu-2050.24.15/osfmk/kern/exception.c exception_triage. The handler can be anything: in-process, out-of-process, or nonexistent.

Mach Exceptions to POSIX Signals

For the normal hardware crash types (everything in <mach/exception_types.h> from EXC_BAD_ACCESS through EXC_MACH_SYSCALL), there’s a host-level exception handler present, installed by the kernel, and whose handler runs in the kernel. The exception port is called ux_exception_port in the kernel, and it’s set up by xnu-2050.24.15/bsd/kern/bsd_init.c bsdinit_task. Note that it uses EXC_MASK_ALL, which does not include EXC_CRASH. The actual handler code is xnu-2050.24.15/bsd/uxkern/ux_exception.c catch_mach_exception_raise (this is probably where things will start to sound familiar if you’ve written your own Mach exception handler before) which, in conjunction with ux_exception and the processor-specific machine_exception, is responsible for mapping the Mach exception to a POSIX signal and sending that to the victim process. The EXC_BAD_INSTRUCTION example will be turned into SIGILL, for example. EXC_BAD_ACCESS, which could have started as #GP (T_GENERAL_PROTECTION) or #PF (T_PAGE_FAULT), maps to SIGSEGV or SIGBUS, depending on the circumstances that caused the trap. Concisely, for Mach exceptions that aren’t EXC_CRASH, unless you’ve got your own exception handler registered at the thread or task level, the in-kernel host-level exception handler will send your process a signal.

Lots of people writing Mach exception handlers set them up at the task level, and they use EXC_MASK_ALL or specific EXC_MASK values to pick the exception types they want to handle. This works if your exception handler is out-of-process, and if in-process, it works as well as an in-process handler can work (with the obvious caveat regarding exceptions on the exception handler thread). You’re not going to get a POSIX signal, but that’s probably fine, because if you’re handling exceptions through the Mach interface, you’re probably not trying to catch signals anyway.

Software-Based Termination

If you’re messing with EXC_CRASH, you probably know that a major drawback of this scheme is that it can only respond to crashes that originated as genuine hardware traps. abort() and all of the things that wind up calling abort() are not, they’re generated entirely in software. This is important for a crash reporter because lots of interesting crashes arise through this mechanism, such as assertion failures and runtime (C++ and Objective-C) exceptions. abort() is implemented in Libc-825.26/stdlib/FreeBSD/abort.c abort, and it raises SIGABRT all on its own, without ever triggering a hardware trap. That means that your program can catch these crashes in-process via the POSIX signal interface, but because it was never a Mach exception to begin with, there’s no opportunity to catch one.

This is where EXC_CRASH comes in. EXC_CRASH is a new (as of Mac OS X 10.5) exception type that’s only generated in one place: when a process is dying an abnormal death. In xnu-2050.24.15/bsd/kern/kern_exit.c proc_prepareexit, the logic says that if the process is exiting due to a signal that’s considered a crash (one that might generate a “ core” file, identified by the presence of SA_CORE in xnu-2050.24.15/bsd/sys/signalvar.h sigprop), an EXC_CRASH Mach exception will be raised for the task. Along with several other signals, the SIGILL, SIGSEGV, SIGBUS, and SIGABRT examples above are all core-generating, so they qualify for this treatment. By the time a process is exiting due to an unhandled signal, it’s a goner. It’s not going to be scheduled any more. That includes any Mach exception handler that was running on a thread in the process. This is why you can’t catch EXC_CRASH exceptions in the process itself: by the time an EXC_CRASH is generated, your process is no longer running. Indeed, in the bug report, you can see the abort() as an “upstream” caller of in-kernel process teardown code, passing through proc_prepareexit, exception_triage, and ultimately getting blocked waiting for a response to mach_exception_raise that will never come.

I recommend EXC_CRASH as the best way to handle crashes, but it absolutely requires an out-of-process handler, which is a more robust architecture for other reasons anyway. If your handler needs to be in-process for whatever reason (including being on a platform where you’re not supposed to be able to run more than one process), EXC_CRASH won’t work, but nothing’s stopping you from writing signal handlers, or from writing a Mach exception handler for all of the hardware-based exceptions and adding a SIGABRT handler to cover software-based crashes.

Leading by Example: Apple’s Crash Reporter

Apple’s Crash Reporter (or CrashReporter, or ReportCrash, after the name of its executable) is kind of fibbing when it tells you that the reason for your crash was “ EXC_CRASH (SIGABRT)”. Everything that it catches was caught via EXC_CRASH. When it catches an EXC_CRASH that originated as a hardware exception, it recovers the original Mach exception type from the exception codes passed to it, stashed away by proc_prepareexit, and it shows you that instead of EXC_CRASH. This is where most people’s first encounter with EXC_CRASH comes from, and based on preexisting experience with the Mach exception handling interface, it can be misleading.

ReportCrash is set as the default exception server for EXC_CRASH by launchd. launchd-442.26.2/src/core.c job_set_exception_port sets an internal variable, the_exception_server, the first time it sees a job that contains a Mach service definition that contains an ExceptionServer key with a boolean value (regardless of the actual value), or any time any Mach service definition has a value attribute that’s a dictionary regardless of the key. (This behavior seems odd to me too.) For a user launchd, the launch agent at /System/Library/LaunchAgents/com.apple.ReportCrash.plist provides the_exception_server in the form of a service named com.apple.ReportCrash. Subsequently, when launching any process (and user launchd is responsible for launching all processes in a user’s graphical login session), launchd-442.26.2/src/core.c job_setup_exception_port will default to the_exception_server as the task-level EXC_CRASH handler if no other exception port was specified. For the system launchd (which runs as the init process), the launch daemon at /System/Library/LaunchDaemons/com.apple.ReportCrash.Root.plist provides the_exception_server in the form of a service named com.apple.ReportCrash.DirectoryService, and once it’s detected, job_setup_exception_port immediately sets it as the host-level EXC_CRASH handler. Because task-level exception ports are inherited from parent processes by their children, this allows ReportCrash to run as the logged-in user for a crash in any process that descends from a user’s session, unless overridden by setting a different task-level EXC_CRASH handler. System-level coverage is provided by a ReportCrash instance that runs as root for any other process on the system descended from the root launchd ( init).

To handle user processes not descended from a user’s launchd, processes associated with a specific user can set the user’s com.apple.ReportCrash handler themselves. login does this to provide user-level coverage of terminal logins via SSH, for example. Finally, to avoid the deadlock problem that would otherwise arise, crashes in the user-level com.apple.ReportCrash process itself are addressed by a distinct user-level instance of ReportCrash operating under the com.apple.ReportCrash.Self service name.

In the pre-10.5, pre-EXC_CRASH days, Mach exceptions and POSIX signals were hopelessly conflated. It was possible for a process to handle POSIX signals gracefully and continue running but to still have the Crash Reporter interface appear because it was triggered by Mach exceptions originating from hardware traps, just like the POSIX signals. The introduction of EXC_CRASH provided the necessary separation between hardware exceptions, which may be handled allowing the process to continue on its merry way, and crashes, which need not originate in hardware at all but are considered terminal.

Take It Outside

If you’ve got the luxury of handling your crashes out-of-process, I strongly recommend doing so. Generic in-process crash handling has always been somewhat dangerous, because it involves trying to accomplish something in a process whose state is effectively unknown. Writing an in-process crash handler requires some extremely defensive programming tactics. For example, a crash may have occur because of an out-of-memory condition, or it may occurred while an allocator lock is held, so the handler needs to avoid allocating memory (which would be impossible in these cases), and may even need to pre-allocate resources. In practice, this means that you can’t rely on most of the standard library unless you have assurance that it will operate correctly even in an exception handler. Even system calls can’t be expected to behave correctly: if a process is out of file descriptors, it won’t be able to open a new file to save information about the crash. An in-process crash handler is probably one of the harsher environments imaginable. The most defensive programming still won’t provide 100% coverage of all crashes if the handler is in-process.

By contrast, an out-of-process crash handler doesn’t need to be nearly as defensive, because it’s isolated from the victim process. That makes such handlers much easier to write. Since it’s in control of its own resources, it can make ordinary use of the standard library (including allocators) and system calls. Its role is simply to perform a post-mortem on the guaranteed-dead crashed process. The EXC_CRASH design permits much fuller coverage than would ever be possible with an in-process design.

May 11, 2011

Bilingual Memory Management

I’m a Mac Chrome software engineer. Mac software engineers spend a lot of time reading and writing Objective-C. Chrome software engineers spend a lot of time reading and writing C++. I spend a lot of time reading and writing both.

Both C++ and Objective-C are object-oriented extensions of the classic C programming language. They each take different approaches to achieving their goals at the language level. Each also has its own standard library, and its own set of common idioms used to realize certain behaviors. Although at a conceptual level the languages share similarities, they’re also very different.

Objective-C++ makes up for some of these differences. It’s a language that effectively takes all of C++ and Objective-C and puts them into a blender together. It allows C++ calls to be made from Objective-C code, and vice-versa. Objective-C++ eases the burden of having to deal with the two distinct languages in projects like Chrome that need to use both of them.

Scopers

Chrome, leaning on its C++ roots, tends to make use of C++’s memory-management features. In many cases, Chrome adheres to an object-ownership model that relies on the fact that when an object goes out of scope, it will be destroyed, cleaning up after itself along the way, releasing the memory it occupied to be used for other purposes. To assist in managing this process, Chrome even has template classes that our team informally calls “scopers.” Most widely used is scoped_ptr<>, which is similar to the C++ standard library’s std::auto_ptr<>, and even more similar to tr1::scoped_ptr<>. scoped_ptr<> maintains ownership of a pointer to a C++ object, deleting the object with the C++ delete operator when it goes out of scope.

Clean-up tasks are commonly delegated to scopers: once scoped_ptr<> holds a pointer, the program is guaranteed to delete the object when the scoped_ptr<> goes out of scope, regardless of any return statements or anything else that interferes with the program’s flow. Used in this way, scopers can eliminate cleanup points which, in traditional C, often involve the use of goto. This cleanup code often becomes cumbersome and maintaining it properly is an error-prone process as a program matures. Scopers declare “here’s something that needs to be cleaned up” exactly where the thing that needs to be cleaned up becomes your responsibility, and they automate the cleanup process. This leads to code that’s easier to read and easier to work on.

Another scoper that’s especially relevant to Objective-C++ is scoped_nsobject<>. scoped_nsobject<> works similarly to scoped_ptr<>, except it owns an Objective-C object, and will call -release on the object when it goes out of scope. Since scoped_nsobject<> can be used in the C++ portion of Objective-C code, it’s an exceptionally handy way to gain the benefits of scopers in Objective-C++ files. scoped_nsobject<> is a C++ template class dedicated to operating on Objective-C objects.

Chrome code uses scoped_nsobject<> in preference to the standard Objective-C -autorelease method in cases where it makes sense to do so. It also uses scoped_nsobject<> to maintain ownership of Objective-C objects held in instance variables of other objects, whether those other objects are written in Objective-C or C++.

In C++, using a scoper as an instance variable ensures that when the object is destroyed, the scopers it owns will also be destroyed. Because scoped_nsobject<> will -release the Objective-C object it’s responsible for, this provides a way to integrate management of Objective-C objects into C++’s way of doing things. Objective-C itself doesn’t normally offer this feature, but in Chrome, we’ve enabled the -fobjc-call-cxx-cdtors compiler option, which ensures that destructors for C++ objects (like scoped_nsobject<>) held in instance variables will be called when an Objective-C object is deallocated. Using scoped_nsobject<> like this, in conjunction with the Objective-C language extension, means that we no longer need to write -dealloc methods that simply -release all of the objects that the object owned. Instead, we stick each Objective-C object that needs to be maintained as a strong reference in another Objective-C object into its own scoped_nsobject<>. This gives Objective-C objects the ability to automatically destroy other Objective-C objects that they own, C++-style. It’s been a boon for both readability and maintainability, and has almost certainly kept us from making sloppy errors that would cause memory leaks. In fact, the only real readability problem with scoped_nsobject<> used in this way is that you might have had to go over this paragraph multiple times in order to take it all in.

The Objective-C Property Releaser

scoped_nsobject<> was great, but as Objective-C matured into Objective-C 2.0 and we began adopting some of its newer features, we hit a snag. Objective-C 2.0 introduced @property. Properties can refer to instance variables whose accessors are generated automatically by the compiler, using @synthesize. Properties, especially when coupled with synthesized accessors, can reduce the amount of code that needs to be written. Unfortunately, there’s no provision to release retained properties automatically when an object is deallocated. This seems like a major omission to me, but unfortunately, Apple never consulted me when developing Objective-C 2.0. Chrome code had been using scoped_nsojbect<> to handle this, but synthesized properties require raw pointers to Objective-C objects, and don’t work with C++ objects such as scoped_nsobject<>.

A coworker spotted the impending doom. Given the tools at our disposal, the apparent options were:

  1. Avoid using scoped_nsobject<> for properties marked retain, and @synthesize the accessors. This would put us back to having to call -release from -dealloc methods again, but we wouldn’t have to write the accessors ourselves. Remembering to write all of those -release calls is error-prone and would lead to memory leaks.
  2. Let scoped_nsobject<> handle properties marked retain, but don’t let the compiler synthesize any accessors. This would mean that for each retained property, we’d need to write our own accessors that understood how to interact with the scoped_nsobject<>, but we wouldn’t have to call -release from -dealloc. Having to write all of those accessors as boilerplate code isn’t my idea of fun.

Dissatisfied with these options, I cooked up a solution more in line with my idea of fun. I came up with base::mac::ObjCPropertyReleaser. It’s another example of using C++ features to make Objective-C better. It brings the language closer to where I think it should be, and almost makes up for Apple forgetting to ask for my feedback when they were designing Objective-C 2.0.

An ObjCPropertyReleaser is a C++ object that can go directly into any Objective-C object as an instance variable. In the -init method (or other appropriate designated initializer), the property releaser is told which Objective-C object owns it, and which class that object belongs to. When the Objective-C object is deallocated, the -fobjc-call-cxx-cdtors compiler option causes the ObjCPropertyReleaser’s destructor to run. Taking advantage of the Objective-C runtime’s support for object introspection, the property releaser then determines which of the object’s declared properties are marked retain or copy, finds the instance variables backing those properties which are synthesized, and sends them a -release message.

To sum it up more succinctly, ObjCPropertyReleaser releases everything backing a synthesized property marked retain or copy, and it does it automatically when the object is deallocated.

The property releaser saves us from having to call -release manually as needed from -dealloc methods, and saves us from having to write boilerplate accessors because of the incompatibility between @synthesize and scoped_nsobject<>. It takes the best aspects of scoped_nsobject<> and @synthesize without any of the drawbacks they have relative to each other. It also means that we get to spend less time writing code, freeing us up to spend more time doing other things, like writing columns.

Experienced Objective-C developers might wonder why the property releaser needs to be initialized with both the object that owns it and that object’s class type. After all, you can always determine an object’s class by calling its -class method. As it happens, this would be an incredibly dangerous thing to do when subclassing comes into play. -class always returns an object’s most-specific type, which might not be the type that a given instance of the property releaser is supposed to be responsible for. I considered other ways to design the property releaser, such as making it a base class that could reach into all of its subclasses’ properties, but this seemed like a bad idea. I don’t believe that a base class should ever screw around with subclass’ data. The base class idea would have also made it difficult to use the property releaser in a class that needed to extend another base class but didn’t use the property releaser itself.

In case you’re not using -fobjc-call-cxx-cdtors, you can still use ObjCPropertyReleaser. Just call its ReleaseProperties method from your -dealloc method. Be sure not to do both, though: don’t call ReleaseProperties directly in conjunction with -fobjc-call-cxx-cdtors.

A Gift for You

Both scoped_nsobject<> and ObjCPropertyReleaser are almost entirely self-contained and you might find that they’re completely at home in your project, even if your project isn’t Chrome. The property releaser even comes with a Sesame Street-inspired unit test, which is my current favorite bit of Chrome code. Here are the raw files from Chrome’s Subversion repository:

Four! Four files! Ah, ah, ah.

A Gift for Me

Today’s my birthday. You probably forgot to get me something. That’s fine: if you like the scoper or the property releaser and choose to share them with your favorite Mac developer, that’s enough of a gift for me. If not, well, there’s always next year.

May 3, 2011

A Touch of Yellow

Yesterday, Google announced the availability of Google Chrome Canary for Mac. The Canary is a version of Chrome that’s updated very frequently—in most cases, daily. It offers the absolute latest version of Chrome, closest to what most developers are working on. It’s the first place to see new features, bug fixes, and other changes to Chrome. It’s also entirely untested, so it might not even launch, and if it does launch, it might work so badly that you’d wish it hadn’t.

This version of Chrome has been available for Windows since last August, and it joins the dev, beta, and stable Chrome channels on the Mac. These channels allow users to choose whether to run a very well-tested and well-supported but older version of Chrome, labeled the stable channel, a more recent (and fun) version like the Canary, or something in between. Dev (short for developer) channel releases usually come weekly and are minimally tested so that users are spared major brokenness. The beta channel is tested more thoroughly and is used to stabilize releases before they’re promoted to the stable channel.

Early Warning

The Canary is actually an important part of Chrome’s strategy in that it enables the six-week “release early, release often” schedule to work as well as it does. By getting a Chrome update out to Canary users daily, the amount of time we wait for feedback is dramatically reduced. If a new feature causes Chrome to crash frequently, we’ll know about it within a day of turning that feature on. If an engineer fixes that crash, we’ll have validation that the fix works within a day of making the repair.

The rapid feedback provided by Canary users was the inspiration for its name. The Chrome Canary is our equivalent of a canary in a coal mine, which would show signs of oxygen deprivation or gas poisoning as an early warning to workers that a hazardous condition existed. When a hazardous condition exists in Chrome, the Canary will warn us about it long before we’ll find out from dev, beta, or stable channel users.

Like the other Chrome channels, Canary feedback comes in two ways: from bug reports entered by users, and from usage statistics and crash reports that Chrome provides automatically. One difference between the Canary and the other channels is that in the Canary, the checkbox to enable these automatic reporting features is on by default. (The option is always presented at installation time.)

Silvery Metallic

In a sense, the Canary builds are similar to the existing Chromium snapshots that some users are running. Chromium snapshots are produced automatically, approximately hourly, and are also entirely untested. Chromium snapshots don’t include any of the automatic reporting that the Canary does, so we were missing an important feedback channel when users chose to run Chromium instead of a version of Chrome. The Chromium snapshots also don’t include any automatic updater, which has prompted some to devise their own mechanisms to keep up-to-date. There are other differences between Chromium and Google Chrome, but the Canary intends to remain true to Google Chrome as closely as possible.

Monochromacy

There are a few areas where the Canary intentionally differs from Chrome’s existing dev, beta, and stable builds. As mentioned above, automatic usage statistics and crash reporting are enabled by default in the Canary, emphasizing its role as part of a feedback-gathering system. The Canary can also be installed alongside any other version of Chrome, providing a sort of escape hatch: if the Canary ever winds up unusable, just use a non-Canary version of Chrome for a few days. In fact, the Canary uses a completely different set of settings than any other Chrome installation, so you can run both of them simultaneously, and they won’t interfere with one another.

The Canary can’t be set as your default browser. That’s the official line, anyway, and the features in Chrome that allow it to offer to set itself as the default browser have been disabled in the Canary. Practically, all this means is that when you click on a web link in another application, it’ll open in some other browser, not the Canary. I’ll let you in on a little secret, though: if you’ve installed the Canary on your Mac and really want to use it as your default browser, you can set it as the default web browser in Safari’s preferences. The irony of invoking another web browser to make this work isn’t lost on me, but sometimes, you’ve just got to swallow your pride for a moment to get what you want.

How It’s Done

At the nittiest and grittiest level, producing the Canary is just a minor twist on Chrome’s established build process. When we build Google Chrome, it doesn’t have any idea what sort of channel it’s going to be released to, or whether it’s going to be a Canary. It just knows it’s Google Chrome. The specialization happens at the very end of the process. A script takes these “undifferentiated” but complete builds of Google Chrome, makes a few copies of them, and then makes the necessary changes in the copies to turn them into dev channel builds or Canaries or whatever else is called for.

Many of these changes occur in the browser application’s Info.plist. In the case of the Canary, specialization means that the automatic updater will be configured to treat the Canary as a distinct product from Google Chrome so that the two can coexist side-by-side. This is done by setting KSProductID to com.google.Chrome.canary instead of com.google.Chrome. A similar change is made to CFBundleIdentifier so that Mac OS X doesn’t get confused between Google Chrome and the Canary. The Canary has a setting named CrProductDirName, which is set to Google/Chrome Canary, and the auto-updater is set to use the canary channel by setting KSChannelID to canary. The colorful application and document icons are replaced with yellow ones, the managed preferences manifest is tweaked and renamed corresponding to the other changes that were made, and that’s it. Ding! It’s done, and ready to be released to users without any further delay.

Notably, there’s absolutely no difference in code anywhere between the Canary and other channels of Google Chrome. The side-by-side feature is enabled by basing the location to store settings on CrProductDirName. In every other way that the Canary needs to vary from the other channels, it’s handled by looking at KSChannelID.

Since there’s no difference between Chrome and its Canary, if you ever wound up with a dev channel build of Chrome whose version number is identical to a Canary’s, you could compare their versioned directories and you’d find that they’re 100% identical. If you’re interested in trying this experiment, the stars should align properly for you once a week.

Running Simultaneously

If you want to run Chrome and the Canary side-by-side, you might benefit from these tips. Having the Canary operate out of an entirely different group of settings than other Chrome builds is part of what enables the two to run simultaneously, but it can also be frustrating if you’ve amassed a large collection of bookmarks, extensions, or other settings. If you enable Sync in each, they’ll share data, while maintaining their distinct presences on your hard drive.

If you find that you’re running both Chrome and the Canary simultaneously and need a way to distinguish them, consider giving each a different theme to provide some visual distinction. Personally, I’d go with something heavy on the yellows for the Canary.

How Do I Choose?

Chrome Stable, Chrome Beta, Chrome Dev, Canary, and Chromium. Still not sure which to choose, even in spite of my warnings of gas poisoning?

Handle this like you’d handle the purchase of a new car. Just pick by color.

April 5, 2011

Keeping Up

How Chrome stays fresh without getting in your face, and why its on-disk structure is a little unconventional

Google’s philosophy on software updates is simple: nobody should ever be using an outdated version of an application. On the web, this is easy to accomplish. When a new version of the software that powers a web site is ready, it replaces the old. The original Google Search page prototype is recognizable as an ancestor of today’s equivalent, but it looks different and acts differently. There isn’t a single person in the world today using that 1998 version with its measly 25 million pages, many of which are probably long-gone by now. At any point in time, everyone searching Google is as up-to-date as possible. As the web changes, as features are added, and as the design is tweaked, the world will always be using the current version of Google Search.

What’s Chrome got to do with it?

The same principles that apply to keeping Google Search users current can apply to users of client software. In this context, “client software” simply refers to applications running on your own computer. As with Google Search, the thought is that as client software like Chrome changes, features are added, and its design is tweaked, the world should always be using the current version. Everyone using Chrome should be as up-to-date as possible.

It’s easy to keep Google Search users updated, because all of the search software Google writes runs on Google’s own computers, which are under Google’s control. Keeping Chrome updated is a different story, because the software Google writes runs on your computer, which is under your control. Whenever a piece of software’s author is different from its user, a conflict arises. In this case, the author wants the users to be on the latest approved version, but the user might not want to dedicate time to installing an update, or might not even be aware that a new version is available.

The desire to keep users current is more significant than just keeping everyone on the “latest and greatest” version. As Chrome evolves, it receives fixes for bugs, its speed improves, and it sees new features added. Some bugs might affect stability by causing Chrome to crash or behave erratically. Others might even impact your computer’s security or your own privacy. It’s an absolute priority to deploy fixes for these bug classes. It’s far less onerous to provide support for recent versions than for an arsenal of obsolete versions.

What’s bad for the goose is bad for the gander

The traditional approach to keeping client software updated is to offer new versions to users as they become available. Thanks to the Internet’s ubiquity, this is now handled almost exclusively online. At a specified interval, the software might check to see whether it’s up to date. If it’s not, then it’ll offer its user the option to perform an update. If the user agrees, then the update process typically involves downloading the new version of the software, after which it will be installed, replacing the now-outdated version. Generally, the installation step can’t proceed while the outdated version is running, so the user is usually prompted to quit the program while the update is applied. Whether they actually need to or not, some updates even want the entire computer to restart when they’re done installing.

From the standpoint of keeping all users up to date all the time, the big problem with this traditional approach is the word “if.” What happens if the user doesn’t agree to perform the update? Well, nothing. No update is downloaded, no update is installed, and the user continues using the outdated version. I can’t even blame people for not wanting to update. Most users probably just want to get on with their lives, and having to take time out of their lives to quit a program they were in the middle of using and maybe even restart their computer isn’t a very attractive proposition. As a result, they keep on going with their old versions, and they’re periodically irritated when some box pops up to ask them the same question about updating that they’ve already said “no” to countless times.

This traditional approach is bad for me as the author and it’s bad for you as the user. There really shouldn’t be so much tension between us. We figured out a way to improve upon the status quo.

Silence!

The traditional approach can be simplified by removing the “if” from the equation. By never asking the user any questions like “wanna update now?” and instead always assuming that it’s a good idea to update, the software can stay out of the user’s face, and ensure that the latest version is always present. In a sense, this approach is even easier than asking the questions, because nobody even needs to write the code responsible for asking the user.

This is the first thing that I did when implementing Mac Chrome’s updater. It turned out to be a huge mistake.

Recall that in order to apply an update, the new version needs to be downloaded and then installed, replacing the old version. What happens if the old version is already running? The new version sort of collides with it in unexpected and interesting ways. Interesting forensically, that is. It’s never interesting in a way that I’d want a user to experience.

In Mac Chrome’s case, quietly installing an update in the background while the program was running interacted very poorly with Chrome’s multi-process architecture. Chrome’s IPC ping-pong game relies on the browser, renderer, and other processes being able to communicate with one another. Since they’re all part of the same application, they expect to be able to communicate in a common language, and this language is specific to each version of the application. But if you’re running browser version 4, and behind your back, Chrome is updated to version 5, the next time the browser tries to start a renderer, it’ll start a version 5 renderer. Renderer 5 can’t figure out what Browser 4 wants from it, and Browser 4 has no way to start up Renderer 4, so you’re stuck running version of Chrome that really can’t do anything at all. The only workaround is to quit Chrome and restart it, by which I mean quit Browser 4 and launch Browser 5, which does know how to talk to Renderer 5.

Since the update was performed silently, in the background, and without any way to prevent it, this system just interrupted you without any warning, and didn’t even provide a clear indication that you could recover by restarting the application. This is certainly worse than the problem that we were trying to solve.

Traditional update systems get around this problem by asking the user to quit the program in order to apply the update, but I’m imposing a design constraint: I don’t ever want to ask the user anything. Some update systems download the update and then wait to install it until the user quits the program or next tries to start it, but this approach means that the user winds up having to wait for the update to be installed, and I don’t ever want anyone to have to wait for me.

Peace and Quiet

Ultimately, the solution to the update problem was simple, if not unconventional. Instead of replacing the old version of the application with the new one, the new version is installed alongside the old. If you happened to be running the old version when the new one was installed, it would be able to continue running without experiencing any interruptions. If it was Browser 4, then when it needed a renderer, it would still get Renderer 4, and the two would be able to conduct intelligent discourse. A subsequent launch of Chrome would get you version 5.

I was able to make this work by leveraging the fact that Mac applications are really bundles. A application bundle is a directory that can contain everything it needs to function, and it’s represented to users as a single self-contained icon. All I had to do was take the framework, which is where all of the interesting parts of the program live anyway, as I described in a previous article, and put it into what I call the “versioned directory.” If you poke around inside the innards of the Chrome application bundle on the Mac, you’ll find this “versioned directory” inside Contents/Versions. Each new version of Chrome gets its own versioned directory.

When you start Chrome, dyld, the Mac OS X loader, sees that it also needs to load the framework. The main executable program specifies the framework to the loader by its precise location within the versioned directory. In the this example, dyld will look for the framework in Contents/Versions/12.0.712.0:

mark@rj bash$ otool -L 'Google Chrome.app/Contents/MacOS/Google Chrome'
Google Chrome.app/Contents/MacOS/Google Chrome:
        @executable_path/../Versions/12.0.712.0/Google Chrome Framework.framework/Google Chrome Framework (compatibility version 712.0.0, current version 712.0.0)
        /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.4.0)
        /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 111.1.4)

Evidently, I’ve got version 12.0.712.0 installed, and a quick check of the About box confirms that this is the version I’m actually running. Now take a look at the versioned directories:

mark@rj bash$ ls -l 'Google Chrome.app/Contents/Versions'
total 0
drwxr-xr-x  4 root  wheel  136 Mar 18 23:02 11.0.696.16
drwxr-xr-x  4 root  wheel  136 Mar 23 06:05 12.0.712.0

I’ve got versions 12.0.712.0 and 11.0.696.16 installed. I’m running 12.0.712.0 and am not using the older version at all any more, but when the update to 12.0.712.0 occurred, I had been actively using 11.0.696.16, so the older versioned directory couldn’t have been removed at that time.

Cleaning Up

The update procedure itself is responsible for removing old versions, but it’s very careful not to remove any versioned directories that look like they might be in use when an update is applied. As an extra measure of safety, the updater will always save the versioned directory containing the version being updated from. This means that once Mac Chrome has updated itself, it’ll always contain at least two copies of the entire program. This keeps the old version usable for as long as you continue to run it, but as soon as you quit the old version, there’s no longer any way to start it, because the browser application will then load the updated framework. The old version will stick around until it can be cleaned up during a subsequent update.

This system works beautifully. Users always have the latest approved version of Chrome available to them, but aren’t ever bothered by questions that interfere with their work, or play, or both. The only drawback is that the old versioned directory sticks around after it ceases to be useful, but you’d probably only notice this if you compared the size of a freshly-installed copy of Chrome to one that had been updated. Given the massive amount of storage available on contemporary computers, carrying a little extra heft inside of the Chrome application is a small price to pay in exchange for everything working so smoothly.

It might seem smart to let Chrome clean up after itself and remove old versioned directories when it launches, or at some other point, rather than having to wait for another update to run. In reality, this wouldn’t be a robust or reliable solution. Chrome, or the user running it, might not actually have permission to make changes within the application bundle. In contrast, during a successful update, the updater has already proven that it has the requisite level of access, and is in the best position to attempt removal of the old versioned directory.

Stragglers

In the end, I was able to get nearly everything stashed away neatly within the versioned directory, but there are a few files that Mac OS X insists live at specific locations within an application’s bundle. Fortunately, none of these files have a significant impact on Chrome once it’s running.

Poking around in Chrome’s Contents/Resources directory will reveal these stragglers. As of the current version of Chrome, it includes a couple of icons (.icns files), files supporting scripting and managed preferences, and copyright messages translated into 52 languages (all of those .lprojs). One level up, in the Contents directory, there’s an Info.plist containing general information about the program, and the application’s code signature.

Sign me up

Astute readers might wonder how Chrome’s update scheme works with code signing. Every copy of Chrome that ever leaves Google is “signed” before making its way to a user’s computer. The signature assures that every file contained within the application is present and in the same condition that it was in when the program was built, and that nothing’s been added, removed, or changed. It’s a way to assure that nobody’s tampered with the application, and that you’re running the real deal.

Whenever we release a new version of Chrome, we sign it in such a way that only the contents of that version’s versioned directory are considered for the signature. That way, even if one or more older versioned directories are present on your computer, the signature can still check out as valid, as long as nothing else has been touched.

While I was designing this scheme, I had initially assumed that I’d be able to use symbolic links to handle those files that, for one reason or another, couldn’t live within the versioned directory. For example, if I were to replace fr.lproj with a symbolic link referring to something within the versioned directory, I’d be able to keep the French copyright message, «Tous droits réservés.», inside the versioned directory, too. As it happens, a bug in Apple’s code signing system causes it to completely ignore symbolic links. In this scenario, someone would be able to point the fr.lproj symbolic link elsewhere, perhaps changing the message to «Voulez-vous coucher avec moi ce soir?», and the modification would be undetectable to the code signing system. An awkward invitation like this one is a tame example, but things could certainly get much worse from there.

As a result, I didn’t pursue this design any further, and I created a new policy: no more symbolic links in the application.

Ironically, the code signing procedure itself creates some symbolic links, although they don’t negatively impact its own operation. It’s amusing that the code signing process creates these symbolic links when it has such a hard time coping with those established by others. Go figure.

Reminder!

Chrome’s silent auto-update provides a way to get new versions onto your hard drive, but it doesn’t do anything to make sure that you stop using an obsolete version and switch to a newer one. Without quitting Chrome, a user might obliviously go on using an obsolete copy even after one or more updates are installed. Chrome doesn’t crash nearly frequently enough to force users to restart it (or to find a more stable web browser). As a form of gentle encouragement, Chrome’s got an upgrade detector that puts a little badge on the wrench menu’s icon a while after an update is installed. If you notice the badge and open the wrench menu, you’ll be encouraged to “update Google Chrome.” Because the update has already been installed by the time the badge shows up, this menu item just quits and restarts Chrome, so what’s perceived as an “update” is actually incredibly rapid. Thanks to the session-restore feature, the new version of Chrome will start up and open all of the same tabs you had open in the old version.

March 28, 2011

Of Hacks and Helpers

What’s a Google Chrome Helper, and what has it ever done for me? Also: Google Chrome Framework?

When my team brought Google Chrome to Mac OS X, we were faced with some interesting engineering challenges. One of these was integrating Chrome’s multi-process architecture with the Mac’s application model.

Ping-Pong

Chrome’s multi-process design is such that there’s a single main process, called the “browser” process. (For the technically declined, a “process” is simply a program that’s running.) The browser process communicates with the user by displaying things on the screen, and by taking the appropriate action when the user moves the mouse or presses keys. It’s also a sort of hub connecting Chrome’s other processes.

Whenever Chrome is open, you’ll have exactly one browser process running. If any web sites are open, you’ll also have one or more “renderer” processes. Renderers are responsible for turning web sites into something that you can see and interact with. When you type a web address into the omnibox, the browser starts up a renderer if necessary, and then asks it to load the site. In turn, the renderer asks the browser to go out and get the data it needs from the network, so the browser makes these requests, and ferries the responses back to the renderer. As the data comes in, the renderer builds up an image of what you should see. It passes this back to the browser for display, which dutifully complies, and the web site shows up on your screen. When you click a link, the browser receives the click, and passes it off to the renderer, which might take action by giving the browser something new to display, asking the browser for more data from the network, or both. In the business, we call this “interprocess communication” (or IPC for short), but “ping-pong” is just as good a description.

Chrome’s design includes other process types, too. If you’re looking at a site that uses a plug-in like Flash, there’ll be a “plug-in process” in the mix, whose job is to load and run the plug-in. It communicates (IPC again) with both the browser and a renderer.

Capture the Flag

Chrome’s multi-process guts deviate from traditional application norms. Simple applications exist in a single process. On the Mac, an application’s process is associated with an icon in the Dock. One application, one process, one Dock icon. Easy. In Chrome’s case, our goal is to arrive at one application, many processes, one Dock icon. The special sauce is tying everything together so that this works seamlessly.

When a Chrome browser process needs to start up a renderer, plug-in, or any other child process, it does so by starting a new process that will launch Chrome from the very beginning again, but with a flag telling the newborn process how it should specialize itself. One of the first things that any Chrome process does is check this flag. If there’s no flag, it becomes a browser, which is what happens when you start Chrome up yourself. If the flag says “renderer,” it will become a renderer, find the browser that started it, and set up a new game of ping-pong.

Chrome performs this “capture the flag” protocol as a precursor to doing anything else, but nothing else on the Mac is aware of this convention, because we made it up. One aspect of software engineering is that when the only things you need to interact with are entirely within your own control, you can very easily invent whatever scheme you need to get the job done. Another aspect is that when you do something like this, you might miss another detail, like the fact that your application needs to run within a larger system, and you might need to interact with that system somehow, too.

Will the Real Chrome Icon Please Stand Up?

The problem in this case was that the rest of the Mac system—including the Dock—couldn’t distinguish between the Chrome browser and all of its child processes, like renderers. As far as the Dock was concerned, each Chrome process was just another instance of Chrome. Each would get its own icon. In fact, every time Chrome would start another process, another icon would show up, and it would persist as long as the new process was still running. Imagine seeing a half dozen Chrome icons down there, dancing around while you worked. If you clicked on any of them except for the one associated with the browser process, nothing interesting would happen, because they weren’t instructed to specialize as browsers. If the Dock couldn’t tell the difference between a genuine Chrome browser process and those impostors, how would you ever be able to?

You wouldn’t. And I like and respect you too much to subject you to that kind of harrowing experience.

The problem isn’t restricted to the Dock. If you use the Command-Tab keyboard shortcut to switch between applications, all of those extra Chromes would show up there, too. Without additional care, the Mac might even mistakenly assume that the extra processes are “stuck” or “hung” or “not responding” because they don’t behave as it expects proper UI applications to behave.

Let’s do better.

The Lying King

The first thing we tried, which was an embarrassing, messy, temporary stop-gap solution (known in the biz as a “hack”), was to just flat-out lie. Chrome identified itself to the system as something less than a fully-fledged UI application, because only fully-fledged UI applications qualify for Dock icons. (In Mac parlance, Chrome declared itself as an LSUIElement.) Of course, there was still no differentiation from the Dock’s perspective: now, instead of each Chrome process getting an icon, none would. Not even the browser. That was obviously bad.

The second step of the hack was to to admit having lied about the browser process not being a full UI application. Since the capture-the-flag strategy was able to distinguish between process types with ease, when Chrome detected that it was being launched as a browser, it turned itself into a proper UI application by calling TransformProcessType, essentially undoing what LSUIElement did and doing what LSUIElement didn’t.

This approach still had its share of problems:

  • Since the browser process started its life as an LSUIElement, certain things that should have happened really early on while it was starting up didn’t happen. For example, Chrome wouldn’t start up in the foreground. It’d take an extra click just to start working with the first Chrome window. We had to add another hack to account for that deficiency. (Hacks are nasty creatures. They have a tendency to multiply in this way.)
  • The hack that brought the Chrome window to the foreground caused Chrome to take a measurably longer time to launch. It’s slower to start out in the background and switch to the foreground than it is to just start out in the foreground.
  • The Dock icon wouldn’t bounce when Chrome was starting up. By the time the browser process was able to properly identify itself as a UI application, any bouncing would have concluded. Interestingly, our icon not bouncing caused people to perceive that Chrome was launching faster than it actually was, when in reality, the effects of the compound hacks made it slower.
  • Perhaps worst of all, it made me feel terrible.

Help!

The real solution to all of these problems was to split Chrome up so that the rest of the Mac system would see it as two applications: one for the browser, and one for everything else. We named the “everything else” application the “helper.” The browser application can only specialize as a browser process, and the helper application can specialize as anything else. The browser application declares itself as a UI application, the helper’s declared as LSBackgroundOnly (which is like LSUIElement with even more restrictions). Since applications on the Mac are actually just directories, the helper application is nested inside the browser, so that wherever the browser goes, it takes its helper with it.

Framers at Work

Having two copies of what is essentially the same application sounds like it might use twice as much space as a single-application approach. It doesn’t, because neither application actually contains any Chrome code. All of the true Chrome code exists in only one place, a shared library that we call the “framework.” The framework is another component nested inside the browser, alongside the helper. The only thing that the browser and helper applications do is find the framework, load it, and then jump to it.

This architecture means that the actual code that lives in the Chrome browser application is about as small as anything you’ll ever encounter.

Serving Process

You can see all of Chrome’s different processes at work on your computer by choosing Tools:Task Manager from the wrench menu. Everything listed in the Task Manager is actually a process. You can also use whatever tools your system provides for examining processes. On a Mac, the Activity Monitor will show you all of the Chrome processes.

You can also examine the innards of the Chrome browser application on the Mac. If you control-click the Chrome icon in the Finder and choose Show Package Contents, and then you poke around enough, you’ll find both the Google Chrome Helper and its companion, the Google Chrome Framework. You might also notice other unconventional aspects in its structure. I plan to explain some of those in a future article.

Postscript

While writing this article, I discovered that I had declared the helper app as an LSUIElement instead of LSBackgroundOnly. While both result in no Dock icon for the application, the subtle difference between these two is that LSUIElement applications are permitted to create user interface elements (putting things on the screen), where LSBackgroundOnly applications are not. Chrome’s helpers never need to create any UI, so LSBackgroundOnly is the more appropriate choice. I checked in a change to Chrome to correct this oversight.

Why had we used LSUIElement as opposed to LSBackgroundOnly in the first place? At this point, I don’t even remember. The hack was initially conceived over two years ago and was removed after six months; LSUIElement, eradicated earlier this month, may be its last remnant. That’s another problem with hacks: they tend to outstay their welcome, and can leave detritus in their wake long after they’re forgotten.

Post-Postscript

March 29, 2011

It seems that LSBackgroundOnly was a bit too restrictive. Some plug-ins have a legitimate reason to display windows. The Flash plug-in, for example, can show a file-open dialog. Gmail uses this feature to attach files to messages. With LSBackgroundOnly, the plug-in process was allowed to show the window, but the window couldn’t be brought to the foreground. This presented an obvious problem: the attachment window would most likely be hidden behind the main browser window, and even if a user did manage to expose it, it would have been difficult to interact with. As of earlier today, Chrome is back to using LSUIElement. Why LSBackgroundOnly processes are allowed to show any windows at all remains a mystery.

Columnize This

Interesting things happen to me. I thought you might find them interesting too, so I’m going to share them. Hi, I’m Mark Mentovai. I’m a software engineer at Google, and I work on Google Chrome, serving as the Mac version’s tech lead. Here’s a picture of me with my team. You can recognize me by my tech lead’s uniform, an all-black get-up complete with a helmet and a flag. In this picture, I’m standing on what appears to be a tank, technically leading my team to obliterate a garbage can.

As the Jackass of All Trades, I’ve got other interests. I’m not just going to be writing about Chrome, but the first few articles I’ve planned are about Chrome, and why certain things were designed the way they were. The most interesting (and enduring) aspects of these stories are the problems that my team encountered along the way and how we solved them. I’ve also got a theory that it’s possible to write for a technical and non-technical audience at the same time, and produce something that each group can take something away from without talking down to anyone or making anyone else feel like they’re in over their head. I’m going to put that idea to work in my articles as best as possible.

Finally, I don’t like the sound of “blog” or “blogger,” so I’m going to call myself a “columnist.” Affectation? Maybe, but thanks for indulging me.