Solved – Big Data vs multiple hypothesis testing

causalitycorrelationhypothesis testinglarge datamultiple-comparisons

Nate Silver in his excellent "The Noise and the Signal" warned that we are much in awe of Big Data. But, that Big Data predictions in many fields have been disastrous (financial markets and economics just to name a few fields). With more data, you get more spurious correlations, more false positives, and erroneous answers. In stating so, he also liens on the excellent work of Ioannidis who indicated that over 2/3ds of scientific findings are wrong as they can't be replicated (based on extensive reviews of working papers). In other words, watch out for the many traps of multiple hypothesis testing, especially when you have not even phrased the hypothesis to begin with. "Correlation does not entail causation" still prevails.

Now in a new book (called Big Data) written by Viktor Mayer-Schonberger and Kenneth Cukier, Big Data looks far more promising. Given the size of the sample that often equates to the entire population, you can detect granular relationships between subsets of the data you could never before. And, within this Big Data era correlation seems far more important than causation. Figuring out what variables are predictive gets you far better and rich results than figuring out which ones are truly causal (that often turns into an elusive chase). The author mentions several new tools that are aimed at extracting and analyzing Big Data sets including neural networks, artificial intelligence, machine learning, sensitivity analysis among others. Being unfamiliar with any of those (and very familiar with traditional statistics and hypothesis testing in particular), I can't judge if the author statement is accurate (he is not a quant). Do those techniques truly avoid the traps of spurious correlations, multiple hypothesis testing, model overfit and false positive results?

Can you reconcile both views: Nate Silver vs Viktor Mayer?

Best Answer

This isn't the whole answer, but an important consideration is which part of your data is big.

Consider the following example. I'm doing some analysis on physical measurements of human beings. For each volunteer I measure the distance between the eyes, then length of each digit, the length of the shins, etc. And I record everything in a big table for some exploratory analysis.

If I decide to make my data bigger, I can do two things, I can make more measurements for each person (ie. more features). This is dangerous, as it increases the probability of spurious correlations.

If I decide to increase the number of instances, however, it should actually reduce the probability of spurious correlations, and although the correlations found may not imply causation, they will be more significant.

This is strongly related to the curse of dimensionality, which tells you that adding features (ie. dimensions) can cause an exponential increase in the number of instances required to reliably infer things from your data (unless your data has lower intrinsic dimension, ie. highly correlated features).

Personally, I see big data as an increase in the number of instances rather than the number of features, but this is a cause of confusion.

Related Question