In software development, success is all about failure.
That may sound as if we are carrying the “fail fast, fail hard” philosophy a little too far, but which would you rather have — a quick, definite failure that you can identify, repeat, and analyze, or a series of semi-failures and partial recoveries that leave your application and your data in a corrupt and unstable state? Anybody who’s spent much time testing or debugging software should know the answer: errors that are masked by near-recovery are almost as hard to trace as intermittent errors — and all too often, there’s no real difference between the two. Slow failures masked by partial recovery show up as on-again, off-again bugs.
And yes, we are talking about failing forward — but before you say that you’ve heard it all before, let’s take a look at what that means. In software development (or just about anything else, for that matter), failing forward is failing in a way that allows you to identify and overcome the underlying problem, so that you can move forward, past the point where it has been giving you trouble.
The all-too-common alternatives to failing forward are generally pretty unappealing, once you recognize them for what they are: failing (or even refusing) to recognize the problem, patching it over without understanding it, trying immediate fixes and workarounds that make things worse in the long run, wrestling endlessly with the several-steps-down-the-line symptoms, cosmetic error-trapping designed to make the problem look like it’s being handled, etc.
What they all have in common is that they don’t aggressively engage the problem or attack it systematically; in fact, they treat it as a potentially endless struggle — an approach which can be lethal for a development project. Too often, the real alternative to failing forward is to fail all over the place, and without end.
So failing forward means, among other things, errors which make it clear to the user that an error has occurred, and that identify the location of the error, logs that includes all items of potential interest, and only very limited use of things such as graceful failure, or automatic error recovery. It’s one thing to recover gracefully and go on if external resources or files aren’t available, or if the program encounters corrupted data, but it is quite another thing (and not a good one) to go into graceful recovery mode when the program has malfunctioned internally. When that happens, you’re just papering over an error that is likely to show up again, and again, and again.
But what about the downside of failing so visibly? isn’t it likely to be damaging in terms of customer goodwill? The short answer is yes, no, and maybe — yes, it could do damage if you get in the habit of using your entire customer base as beta testers (although we all could probably name some very large and well-known companies that seem to do just that), no, it can’t be any worse than the goodwill lost by shipping software with slowly degrading performance, and maybe there’s a better approach than making all of your customers deal with your most spectacular failures.
And as it turns out, there are better approaches to implementing fail-forward development; one of the best (and most widely-used) of these approaches is the canary release strategy (named after the coal miners’ canary, whose job it was to drop dead if the air started to go bad).
With a canary release, the current version of your application remains on you main server system, and the vast majority of your customers continue to use it. You install the new version on an isolated set of servers, and test it for basic stability, functionality, and obvious bugs. At that point, you route a small group of users to the new version. This allows you to test the deployment under real-world conditions, knowing that if it does fail in a properly obvious and ungraceful way, the failure will only be visible to those users. When you’re satisfied with the new version’s performance with the initial test group, you can roll it out to a larger subset of users, and work out any remaining bugs that become apparent under a heavier load. After that, you can deploy it to your general user base, replacing the old version.
This means that even if the new version is going to have a truly catastrophic, spectacular failure after operating under heavy load for some time, only the small initial test group (or the larger secondary test group) is likely to encounter it, and rollback is going to consist largely of routing those users back to the main, stable version.
In practice, of course, it isn’t quite that simple. The process of isolating a set of servers and switching a small group of users over to them require you to manipulate your load-balancing system. Rather than routing users to servers based strictly on standard load-balancing criteria, it will need to filter out a specific set of test users and route them to the designated test servers, while continuing to perform its usual load-balancing functions with the remaining users and servers — doing all of this without any degradation in performance. Setting this system up initially may require a significant design-implementation-test cycle of its own, apart for the actual software test deployment.
You also need to give some thought to the process of selecting a test group. Your first, small group should provide you with realistic load and use conditions, so while it could be in-house under some circumstances, it is often better to select people from your user base by some method (generally random or semi-random) that will be unlikely to bias the results. In practice, this requires a sufficiently large and active user base, and even then, canary deployment may not be appropriate for all applications or users (for example, loss-of-property/loss-of-life mission-critical applications).
Given these limitations, however, the combination of fail-forward development and canary release can serve as a very powerful strategy for implementing continuous deployment and maintaining high software quality standards without exposing your users to significant risk or inconvenience.