One of the nastier traps that snags project leads is looking at averages when we ought to focus on medians. A real-time bidding platform can have great average latency, yet lose almost every auction. A call center might dramatically reduce average hold time even as the median, and thus the hang-up rate, actually increase.
Both metrics are simple: The average of a bunch of numbers (or more formally, the mean) is their sum divided by their count, whereas the median is the middle element when you sort them.
>>> jenny = (8, 6, 7, 5, 3, 0, 9) | |
>>> average = sum(jenny) / len(jenny) | |
>>> median = sorted(jenny)[len(jenny) // 2] | |
>>> average, median | |
(5.428571428571429, 6) |
Here’s a sports example: Suppose your local baseball team trounces its opponent in the first game of a three-game series 12 runs to 1, but loses the other two games in 4-run shutouts. The home team has won the “average game” 4 runs to 3, yet lost the median game (sorting by point spread) 4 runs to zilch. Something similar happened in 1975, when the Cincinnati Reds beat the Boston Red Sox in the greatest 7-game World Series ever played: Boston scored more runs per game, but Cincinnati won the median game 4 to 3, and with it the championship.
We keep making this mistake because a fundamental principle is not intuitive: For any aggregation of win/loss situations, it doesn't matter how much you win by. You need to win more often, not win by more points. Aggregation obscures the binary nature of win/loss outcomes: We can't see the trees for the forest.
It doesn't help that we conflate the terms in ordinary conversation. When we say someone is an average Joe, we really mean they’re a median Joe. In a society where the wealthiest 1% of Joes have a billion dollars apiece, while the other 99% have only a thousand dollars each, the average Joe has over ten million dollars. Meanwhile, the median Joe—who is a person, not a mathematical fiction like the average—can barely pay rent.
In some sense, the median value is no “worse” than at least half of the values in its cohort. We can also think in percentiles: Some value, often called P90, is no worse than 90% of values; P95 is no worse than 95%; and so on. The median is P50. Monitoring percentiles and plotting them on a dashboard in real time can be incredibly useful to SaaS companies, DevOps teams, and anyone else who would be immediately impacted by service failures. (I don’t know of a standard name for this kind of chart. If you do, please enlighten me in the comments.)
The horizontal lines represent Service Level Objectives (SLOs). In this chart, the P50 latency SLO is 300 ms, and the P99 SLO is 800 ms. That means at least 50% of service requests should get a response within 300 ms, and 99% should get a response within 800 ms, or we have a problem. In this example, SLO violations began around 11:20 a.m., but the system was back on track by 2:15 p.m. A heads-up observer watching the chart in real time could have seen the writing on the wall by 10:30 a.m., and maybe taken steps to mitigate the problem, which is a heck of a lot better than getting paged after the latency hits the fan. Another common pattern is a spike meaning something’s gone suddenly wrong: That’s obviously less useful for prevention, but fantastic for debugging because it shows you exactly what timestamp to look for in your service logs.
Let’s wrap up this post with a visualization that remains surprisingly unfamiliar among software engineers: the box plot. Here’s a box plot of latencies (unrelated to the SLOs we just looked at) grouped by calendar quarter:
Each vertical line represents a dataset sorted into buckets called quartiles, each containing an equal number of samples. The top and bottom of each line represent max and min values, the horizontal divider represents the median, and the top and bottom of the box mark the 75th and 25th percentiles respectively. Sometimes you’ll see horizontal “whiskers” across the top and bottom of each line.
Box plots aren’t ideal for real-time monitoring, but they’re awesome for seeing the big picture; say, at the end of an OKR cycle. This example shows that even though average latency got worse over the course of a year (as the minimum and P25 increased), the median and worst-case latencies improved. Focusing on the median helps us achieve our objectives—in other words, win—more often.
Focus on the Median, Not the Average
This is fantastic and spot on. Mark Broadie (creator of Strokes Gained - the dominant Golf stat) talks about exactly this point when measuring across people and skills.