Quicksort has a reputation for speed. In school, you learn that it is **O(n log n)** … the optimum for all single-threaded, comparison-based sorts. In fact, when combined with median-of-three pivot selection, tiny-partition insertion-sort, and special equal-keys handling (qv Top 3 Quicksort optimizations) it is faster than its closest comparison-based competitors: Merge sort and Heapsort. (That doesn’t apply to near-sorted input, but that’s another story). Timsort makes the best of Merge sort, with similar extensions, to cover more edge cases than the above quicksort add-ons.

But it puzzled me that Quicksort was often much faster than, say, a heap-with-hole sort — an optimization of simple Heapsort. Then, data-oriented-design ideas rang a bell: real-world keys are often complex data structures: pointers to key-parts scattered across the heap. Getting all the parts from RAM to cache is a dominating cost. And here are some pertinent features of Quicksort:

- all comparisons are against one (pivot) key. No matter how scattered the pivot key parts are, in RAM, it is guaranteed they are in L1 cache.
- as Quicksort recurses into smaller and smaller partitions, it eventually reaches one small enough that all scattered key parts are in cache. From there on down, every child recursion will be using keys completely in cache. This is repeatedly true as Quicksort works its way down through the sizes of L3/L2/L1 cache; and applies to the bottom-level insertion sorts.

Next up: how radix sort can beat Quicksort on its home (cache) turf, for real-world (complex) keys.