If you want fast access to lots of data, they must be carefully organized. You can look up any dictionary or phone book entry in seconds, but only because the data are sorted. What’s more, maintaining almost any large data set requires small but frequent modifications. Telephone directories, astronomical catalogues, and lists of catalogues all have to be updated regularly lest they rot.
The ubiquitous pattern of small, frequent changes to large data sets is key to the efficiency of modern Version Control Systems (VCS). A repository can store thousands of versions of a code base without copying the whole thing for each version (as someone without a VCS might do), so long as each version is mostly the same as some earlier version. In fact, versions need not be chronologically ordered: Repositories can be forked and branched, then merged or diverged to represent concurrent states.
In software, arrangements of information are called data structures. Structures that support multiple, concurrent versions of data are described as persistent, in contrast to ephemeral structures that support only one version at a time. Persistent structures typically rely on structural sharing for efficiency. For example, when you make a Git commit, Git stores a tree representing each directory, with links to blobs representing files. If you commit a change to a single file, the new tree links to the same blobs as the old one, except for the file you changed. Two successive commits thus share the state of all unchanged files. Here’s a four-minute screencast you can watch if you’d like a more in-depth explanation:
Hash-based linking is the basis for distributed persistent systems like Git, but the same technique works using in-memory C-style pointers or object references. For example, a reference called ‘list_a’ might point to a linked list of numbers:
We can efficiently add an element simply by allocating a new node, and aiming our list reference at it:
But what if we still need the original, unmodified list for some reason, as well as the updated version? Instead of updating the original reference, we can add a new reference, list_b, and let the two logically distinct lists share most of their physical structure:
In fact, we can construct as many concurrent versions as we want, possibly constructed by parallel threads, all safely sharing a common tail:
All this structural sharing makes for extremely efficient storage and parallel processing. It’s not limited to linked lists, either: Since persistence was recognized in the 1980s, our toolbox of available structures has grown by leaps and bounds.
We develop simple, systematic, and efficient techniques for making linked data structures persistent. We use our techniques to devise persistent forms of binary search trees with logarithmic access, insertion, and deletion times and O(1) space bounds for insertion and deletion.
—Jim Driscoll, Making Data Structures Persistent, 1989
There’s one important caveat: Shared data must be immutable. Mutating elements of one list in a shared-structure family would implicitly change all the others, so we could no longer treat them as logically distinct. The same rule holds for hash-based structures like Git and blockchains, because an object’s hash code—effectively, its location in the immense address space that hashes-as-pointers emulate—depends on its value.
The easiest work-around for immutability in a persistent data structure is to copy a minimal subset of the original data. For example, if we need to insert a value in the middle of the original list, we can copy elements rather than move them:
More sophisticated strategies are also available, such as keeping a set of “delta” objects that represent semantic modifications atop the shared structure. There’s a lot more to discuss here, in terms of both implementation and application: Software Transactional Memory (STM), Chains of Trust, and hybrid node-based/flat structures like blockchains. Leave a comment if you’d like to dig deeper in any particular direction.
Since you're on this thread, I think it would be really useful to show an example of what a blockchain _is_. 99% of everybody know it as a buzzword, and don't actually have any idea how one is constructed.