Fail-Fast

It's a good idea... mostly.

Sep 06, 2021

There’s an adage, sometimes called the fail-fast principle, that software failures should be immediate and visible. The rationale is that a borked program wastes time and other resources producing output that can’t be trusted anyway. The best you can hope for is to learn as early as possible that something went wrong, so you (or an automated system) might fix whatever caused the problem, then rerun the program. For most programmers, most of the time, this is good advice; but it has real limitations that we ought to recognize.

First, failing loudly isn’t good enough. The sound of failure must be useful. For example, if you’ve spent much time on pager duty (aka on-call rotations), you probably despise alerts that aren’t actionable. It’s not enough to know that something broke. What are you supposed to do with that information? Lousy alerts waste your time; and worse, they train you to ignore alerts in general. A similar issue applies to compiler diagnostics, especially deprecation warnings: They’re not useful unless you have some idea how to fix them. Sites like StackOverflow are popular mainly as mappings from cryptic error messages to actionable guidance.

Second, and more importantly, fail-fast passes the buck to some larger system. It presumes someone or something will be there to pick up the pieces. This isn’t always the case. Sometimes, a broken system is better than no system at all, especially when controlling hardware that would otherwise stop working entirely. If control software on an airplane reports an error, maybe it’s OK for a single process to shut down; but at some point, some part of the stack has to support continued operation, because it’s not OK to let the plane drop out of the sky. Control software should not fail fast and loudly when absolute surrender is not an option. Embedded software, be it on a cell phone or a pacemaker, must continue operation at least well enough to download patches, or else the host device might as well be a brick.

Here are some approaches that we, the software industry, should take seriously, especially engineers and Product Managers.

Demo failure modes, not just happy paths. Insist that vendors do likewise. What happens when you feed the program bad input? Run out of disk space? Have a lousy Internet connection? Software should help solve your problems, not exacerbate them. Take pride in your product’s resilience, and teach your users to expect not only quality, but craftsmanship.
Degrade gracefully. Even a command-line tool passed bad arguments can usually do better than printing a generic usage message; or worse, a stack trace. Ask yourself what users should do in the face of each potential error, and try to make that action as clear and explicit as possible. Don’t assume they’re domain experts.
Formally test error cases. Check that functions return (or throw) meaningful errors when passed invalid arguments. Think hard about what’s going to make life easier for the person calling your code. Remember that in practice, bad input to your function is probably the result of some undetected upstream error, and take pity on the person trying to debug the resulting breakage.

If you have further ideas about how we can do better than blanket application of the fail-fast principle, please share them in the comments.

Deeply Nested

Discussion about this post