Redundancy, Part 1

Sep 30, 2022

Computer programs come in two basic kinds: Those that run as needed, and those that run continuously, responding to events as they happen. The former are usually called batch jobs or applications, and the latter daemons or services (or microservices if they’re particularly svelte).

Suppose we’re responsible for a mission-critical service; that is, a computer program that’s supposed to keep running all the time so it can respond to important events like client requests. Nothing in this world is perfect, so there’s always a chance the service will crash, causing downtime for our system and our business. Hardware failures and software bugs (especially race conditions) are the usual culprits, but let’s assume that in any given second of time, there’s very little chance—say, one in a million—that our service will crash. In other words, if our service is humming along happily, there’s a 99.9999% chance it will continue to do so for at least the next second. Let’s call that chance P, and the one-in-a-million chance of failure Q.

$ node
Welcome to Node.js v16.17.0.
Type ".help" for more information.
> Q = 1e-6
0.000001
> P = 1 - Q
0.999999

The odds of failure compound geometrically. For example, the chance that our service will stay up for at least the next three seconds is the product of the chance that it won’t crash in the next second, nor in the second after that, nor the one after that: P * P * P, or P**3.

> P ** 3
0.999997000003

In general, the chance that a happy service will remain so for >=N seconds is P**N. The chance that our service will stay up for at least a minute is P**60, for an hour P**3600, etc. So, how long will it be before our service is likely to crash? What’s the time horizon beyond which the chance of failure exceeds 50%? Not long, as it turns out, because geometric growth is downright nasty.

> Math.log(0.50) / Math.log(P) / 3600 / 24
8.022532800536636

Eight days. That’s it. If you run a single-instance monolithic service in production, you’re liable to have a complete outage every couple of weeks. The one-in-a-million Q value we’ve been using might charitably be called “imprecise,” but this jibes with real-world experience: Too many small companies deal with random outages almost every week. These outages take a harrowing toll not only on productivity, but on the quality of life of the engineering and product support teams, including client-facing staff like Account Managers. People shrug off this absurd burden as though it’s inevitable (“Startup life, am I right?”) and euphemize it in job postings as a “fast-paced environment,” but it’s truly an awful way to live. What’s more, it’s easily avoided.

The first and most obvious approach—and the only one we’ll cover in this post—is to run two instances of every service. Any more than two is gravy.

🪤 If you’re an engineer who has worked at large Software as a Service (SaaS) companies, then advice like “have redundancy in prod” may sound utterly obvious. But if ever you start bopping around startups as an advisor or contractor, you will be stunned at how common these fault-intolerant monoliths are, mostly because the tech teams don’t understand the trade-offs involved.

Because we’re discussing random crashes, the odds of one instance crashing are independent of the other instance. We can thus replace Q with Q*Q in our formulae above, making our new probability of surviving the next second 1 - Q*Q. Let’s call that P2:

> P2 = 1 - Q*Q
0.999999999999

Now how long can we expect at least one instance of our service to stay up?

> Math.log(0.50) / Math.log(P2) / 60 / 60 / 24
8022714.288272493
> _ / 365.25
21964.9946290828

Almost 22,000 years. Individual service instances will still crash as often as before, but those crashes won’t cause major outages, because both instances are extremely unlikely to crash at the same time. (One could crash while the other is already down, but that chance becomes minimal if we address crashes promptly—even if all we do is restart the crashed instance.) We still have to deal with the crashes, but they’re no longer oh-my-god all-hands-on-deck emergencies, and they don’t impact users at all.

Productivity and quality of life improve tremendously if we run more than one instance, especially if our service is a little flaky to begin with. Going multi-instance does impose a couple of architectural requirements though:

We need a load balancer (LB) like HAProxy to automatically route traffic to a healthy instance. Our service also needs a health check endpoint, so the LB can tell when an instance isn’t feeling well. (But if we don’t already have a health check, how do we even know when we’re having an outage? Wait for customers to complain? <shiver>)
Our service instances must be stateless. The system as a whole needn’t be stateless, and can keep whatever database (DB) it already uses. But now, it’s especially important that we not keep any state where only one instance can access it, such as in-memory variables.
People sometimes try to skirt this requirement using “sticky sessions,” meaning they ensure that consecutive requests from a single user are always routed to the same instance. Don’t do that. It complicates the LB config, boots users out of their sessions whenever an instance crashes, and normalizes instance-local state that may not even be session-specific. Put your data in a DB, or in shared “data structure servers” like Redis.

Diagram showing a user, a service instance, and a database on the left; and a user, load balancer, two service instances, and a database on the right. Boxes labeled "Session state" appear in the service instance on the left and the database on the right. — On the left, users send requests directly to a single service instance, which may be stateful. On the right, requests are sent through a load balancer that routes them to either of two instances, both of which are stateless. The user on the left probably shouldn’t be smiling, because the single-instance system is an outage waiting to happen.

In upcoming Deeply Nested posts, we’ll discuss further use of redundancy to improve uptime, and to gain other benefits like economies of scale. In the meantime, please do share relevant thoughts or experience (or horror stories!) in the comments. And here’s a question for you: Going forward, should system architecture posts continue discussing high-level issues like redundancy; dive deeper into topics like orchestration and containerization; or get into the nitty gritty of how to configure particular tools like Kubernetes and Docker?

Deeply Nested

Redundancy, Part 1

Discussion about this post