Discover more from Deeply Nested
Click the code images below to zoom, or click their captions for plain text.
In a perfect world, things that shouldn’t matter wouldn’t matter. We do not, as you may have noticed, live in such a world.
One thing that shouldn’t matter is the traversal order of data structures representing mathematical sets. Semantically unordered collections (like hash sets and database tables) come up frequently in software development, but they’re not really amenable to the usual “Divide and Conquer” approach.
Divide and Conquer and Subtle Interactions
We once discussed Divide and Conquer as a Project Management technique. This time, we’re discussing it in the context of data processing.
To “Divide and Conquer” is to break (Divide) a big problem into smaller ones that we can solve (Conquer) independently. For example, to add up a bunch of numbers, we can take them two at a time (Divide), and apply our handy binary + operator (Conquer) to get partial results, repeating until we’ve summed the whole collection:
Addition is a seductively flexible operation, because it is order agnostic in two important ways.
It’s associative, meaning it doesn’t care about the order of application: (1 + 2) + 3 = 1 + (2 + 3). It doesn’t matter how we Divide our collection into pairs.
It’s commutative, meaning it doesn’t care about the order of its operands: 1 + 2 = 2 + 1. When we Conquer each pair of numbers, it doesn’t matter which number goes on the left or the right.
As it turns out, though, most Software Engineering is a lot harder than adding up integers. In fact, much of computer programming entails being surprised to discover, again and again, that seemingly simple things are not so simple after all. For example, sticking with that handy dandy binary + operator:
Oops! The sum of a collection of floats depends on traversal order, because floating point addition is not associative. Specifically: Epsilon is the smallest number that can be added to 1.0 without being discarded as a rounding error. The sum epsilon/3 + epsilon/2 is greater than epsilon; so, if we add the numbers up in their original, sorted order, we’ve picked up enough “momentum” by the time we hit 1 to tip it over to the next representable value. If we instead sum these numbers in any other order, none of the individual fractions of epsilon is ever big enough to affect the accumulated sum. United they stand, divided they fall. Non-associative operations interact badly with the Divide step of Divide and Conquer dataset processing.
💡 Fun fact: Rust doesn’t allow sets of raw floats, because their comparison semantics are so… shall we say, irregular?
That dirty double-crossing binary + operator, that we thought was our friend, is also commonly used for concatenation. Concatenation has the reverse problem of floating point addition: It’s associative, but not commutative. In other words, “Hello” + “World” is not the same as “World” + “Hello”. The concatenation of a set thus depends on traversal order. Concatenation cares which element comes first, but there is no meaningful “first” element of an unordered collection. Non-commutative operations interact badly with the Conquer step of Divide and Conquer dataset processing.
There’s no such thing as an original sin.
—Elvis Costello, I’m Not Angry
In most languages, hash set traversal order may vary from run to run:
A common consequence is test flakiness. If you have tests that pass or fail seemingly at random, one likely culprit is a set being reduced by a non-associative or non-commutative function:
Rust takes the opposite approach, making HashSet traversal order deliberately hard to predict. The goal is to prevent programmers (and attackers) from relying on hash implementation details. But the fact that Rust can save us almost entirely from the other main flaky test culprit (data races), yet really can’t help with traversal order issues at all, speaks to how profound this problem is.
It’s tempting to think the solution to set traversal order dependence might be to do all the work in parallel. Unfortunately, we can’t usually do everything all at once, because operations depend on each others’ results. We can’t start smooshing our partial results together into a final, cohesive solution until we’ve done some Conquering, and we can’t start Conquering until we’ve done some Dividing.
While the parallel collections abstraction feels very much the same as normal sequential collections, it’s important to note that its semantics differs, especially in regard to side-effects and non-associative operations.
In that perfect world where we don’t yet live, programming languages would automatically recognize operations that might not be commutative or associative, and would warn us if we used order-dependent operations unordered collections. Until we live in that world, there are two ways to minimize the fallout from the bizarre reality that even unordered sets must be traversed in some order, however meaningless:
Be aware of the associativity and commutativity of the operations you rely on, especially when using them to Divide and Conquer (respectively). When designing new functions, try using them with unordered containers in your test suite, and do what you can to make them order-independent.
If your data can be meaningfully sorted, logarithmic access time is acceptable, and your standard library offers ordered collections like BTreeSet/Map, std::set/map, and TreeSet/Map, consider choosing them over hash-based implementations.
Please share relevant thoughts, research, or libraries in the comments.
Thanks for reading Deeply Nested! Subscribe for free to receive new posts and support my work.