Statisticians cringe when people toss around the term “average“. “Average” has a narrow application: predicting the sum-total, in the long run, with many repeated instances on the way. But people use the average to predict what the next event will be — and average just doesn’t work that way. Local and recent trends in data aren’t reflected in averages — e.g. predicting tomorrow’s weather based on the annual average; and tiny-seeming changes in averages can mean huge changes in the real world (think: global warming).
What people really need when trying to detect change in the short term is the median. Unfortunately, it is easier for software to compute “average” than “median”: add up the numbers, divide by the count. So common software tools have always provides handy AVERAGE functions. AVERAGE gives you a pretty number. MEDIAN gives you something you can make a decision on.
I love SQL for the same reason I like bit twiddling — inherent parallelism. But”median” is just not something basic SQL was built to do. Suppose I have a table List(num INTEGER). The SQL for MEDIAN looks like:
SELECT ( (SELECT MIN(num) FROM List X WHERE (SELECT COUNT(*) FROM List Y WHERE Y.num <= X.num) >= (SELECT COUNT(*) FROM List) / 2) + (SELECT MAX(num) FROM List X WHERE (SELECT COUNT(*) FROM List Y WHERE Y.num >= X.num) >= (SELECT COUNT(*) FROM List) / 2) ) / 2
Ew — not even considering execution time. The good news is, there is a common “special” case in the real world: small numbers are more common than large numbers. That’s typical for any population that divvies up a limited resource. For positive numbers, this leads to the situation that AVG(num) and 1/AVG(1/num) neatly bracket the median; it is very closely approximated by the geometric mean of the two. The SQL for this is somewhat smaller:
SELECT SQRT(SUM(num) / SUM(1.0/num))