Solved – What are some good examples of exploratory data analysis today

data miningdata transformationdata visualizationdescriptive statisticsexploratory-data-analysis

Are there some papers published which illustrate EDA used to tackle substantial data problems? I am particularly looking for actual (current) data examples, where plots have been made and statistics computed that reveal things in the data that we would not have been able to detect otherwise, or with models. Here are a couple of examples of what I am interested in finding. Both of these examples show things that were discovered in data by making plots. I'd also be interested in discoveries made by rough calculations, like Tukey used to do, eg like median polish. Not from fitting models, where lots of assumptions are required.

This is an old example, from a data set on tipping in restaurants, see introduction of ggobi book for the full example,

enter image description here

with the observation that "many diners round tips to the nearest $1 and 50c value". The peaks in the histogram with the small bandwidth occur at regular intervals, too much to be due to chance. Hand et al found similar behavior when mining a large credit card data set, when customers were purchasing petrol in the UK. He followed up the discover by setting up a model that had multiple components, one with the rounding behavior and another following a more regular distribution.

See Hyndsight blog for a recently released statistics on unemployment. This is the critical picture:

enter image description here

with the observation, "that there is something different about Aug this year." The most plausible explanation is a change in the way the unemployment is being collected.

Best Answer

One example I enjoy (and is a simple illustration) is the work by Michael Maltz on analyzing the uniform crime reports that police agencies supply to the FBI. See:

Maltz, M. D. (2010). Look before you analyze: Visualizing data in criminal justice. In Piquero, A. . and Weisburd, D., editors, Handbook of Quantitative Criminology, chapter 3, pages 25-52. Springer New York, New York, NY.

For some background, the FBI does not have standardized ways to report missing or incomplete reports (they collect data monthly, so an agency could report for some months but not the entire year). So the uncritical would observe zeroes or very low numbers for a particular jurisdiction and not presume missing data, e.g. see the numbers for Florida in Parker & Pruitt (2000). So there is quite a bit of precedent in the criminology literature of modelling this data without discovering such errors.

Here is a good example from blogs discussing published papers:

Uri Simonsohn on the Data Colada blog and Felix Schönbrodt on a failed replication in pyschology and how ceiling effects of the instrument are not an issue. Here are the images of the original and replication ECDF's from the Data Colada blog:

There are also some good examples on this site. I thought I had a good example here but a few others that I really enjoyed are:

Improving data analysis through a better visualization of data?
Which permutation test implementation in R to use instead of t-tests (paired and non-paired)?. A terrific quote by G. Jay Kerns here "In my opinion, these data are a perfect (?) example that a well chosen picture is worth 1000 hypothesis tests. We don't need statistics to tell the difference between a pencil and a barn.".
This is a bit more of a contentious one which I might rename to If a statistician were in a cave her whole life and then one day was shown a scatterplot what would she see?

I realize these aren't published, but I think are illustrative nonetheless. I'm sure you could cull up more on this site as well.

Best Answer

Related Solutions

Data Visualization – How to Determine When Interactive Data Visualization is Useful

Solved – Modern successor to Exploratory Data Analysis by Tukey

Related Question