Solved – Famous statistical wins and horror stories for teaching purposes

data visualizationexperiment-designmathematical-statisticsteaching

I am designing a one year program in data analysis with a local community college. The program aims to prepare students to handle basic tasks in data analysis, visualization and summarization, advanced Excel skills and R programming.

I would like to prepare a set of short, real world examples that illustrate where ordinary intuition fails and statistical analysis is necessary. I'm also interested in "famous statistical fails", but more interested in the wins. The data involved should be freely available.

A perfect example of what I'm looking for is the Berkeley discrimination case, which illustrates Simpson's paradox. The data for that is memorialized in R's datasets.

Historical cases are also interesting. John Snow's analysis of the Broad Street pump data is a good example of the power of visualization.

There are a lot of fails in the collection of data (selection bias), etc. and the literature in medical statistics is full of them.

A lot of "statistical wins" occur in the area of variable selection and sampling design. I'm interested in paradoxes that occur in other areas — like the analysis as such.

Best Answer

Benford's law:

Described here. Digits do not appear with uniform frequency in front of numbers, but rather follow a specific pattern: digit 1 is the most likely to be the first digit, with 30% chance, followed by 2 (17.6% chance), and so on. The following picture (from Wikipedia) shows the frequency of each digit at the beginning of each number, in some naturally-occurring datasets:

Frequency of each digit at the beginning of each number, in some naturally-occurring datasets

There are certain conditions under which the law holds (e.g., the data should span across several scales, so stuff like peoples' heights are not eligible), but it is quite generic.

Perhaps the most surprising application is in fraud detection. This is based on the assumption that people who try to fabricate figures tend to distribute the digits uniformly, thus violating Benford's law.

I recall once I was explaining this to a class, and in the break one of the students came up with an accounting spreadsheet from his company, in which he had tried to validate my claims. It worked :)

Zipf's law

Described here: The frequency of a word in a corpus is inversely proportional to its rank. What is surprising is that this relationship holds for any corpus, even for ancient languages that have not yet been translated. An interesting video explaining more about why this pattern may hold is here. The following picture shows rank (horizontal) vs frequency (vertical) in a log-log scale for the first 10 million words in 30 Wikipedias (source). Note that the law would predict a straight line: Rank vs Frequency

These two laws are powerful and counter-intuitive, and in the sense that they enhance one's understanding of the world via statistics, they could be called "statistical wins".