You don’t mean average

Statisticians cringe when people toss around the term “average“. “Average” has a narrow application: predicting the sum-total, in the long run, with many repeated instances on the way. But people use the average to predict what the next event will be — and average just doesn’t work that way. Local and recent trends in data aren’t reflected in averages — e.g. predicting tomorrow’s weather based on the annual average; and tiny-seeming changes in averages can mean huge changes in the real world (think: global warming).

What people really need when trying to detect change in the short term is the median.  Unfortunately, it is easier for software to compute “average” than “median”: add up the numbers, divide by the count. So common software tools have always provides handy AVERAGE functions. AVERAGE gives you a pretty number. MEDIAN gives you something you can make a decision on.

I love SQL for the same reason I like bit twiddling — inherent parallelism. But”median” is just not something basic SQL was built to do. Suppose I have a table List(num INTEGER). The SQL for MEDIAN looks like:

SELECT (  (SELECT MIN(num) FROM List X
           WHERE (SELECT COUNT(*) FROM List Y WHERE Y.num <= X.num)
                  >= (SELECT COUNT(*) FROM List) / 2)
        + (SELECT MAX(num) FROM List X
           WHERE (SELECT COUNT(*) FROM List Y WHERE Y.num >= X.num)
                  >= (SELECT COUNT(*) FROM List) / 2)
       ) / 2

Ew — not even considering execution time. The good news is, there is a common “special” case in the real world: small numbers are more common than large numbers. That’s typical for any population that divvies up a limited resource. For positive numbers, this leads to the situation that AVG(num) and 1/AVG(1/num) neatly bracket the median; it is very closely approximated by the geometric mean of the two. The SQL for this is somewhat smaller:

SELECT SQRT(SUM(num) / SUM(1.0/num))
Advertisements

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s