Big Data Analysis – How to Draw Valid Conclusions from Big Data

data miningdatasetlarge datavalidation

"Big data" is everywhere in the media. Everybody says that "big data" is the big thing for 2012, e.g. KDNuggets poll on hot topics for 2012. However, I have deep concerns here. With big data, everybody seems to be happy just to get anything out. But aren't we violating all classic statistical principles such as hypothesis testing and representative sampling?

As long as we make only predictions about the same data set, this should be fine. So if I use Twitter data to predict Twitter user behavior, that is probably okay. However, using Twitter data to predict e.g. Elections completely neglects the fact that the Twitter users are not a representative sample for the whole population. Plus, most methods will actually not be able to differentiate between a true "grassroots" mood and a campaign. And twitter is full of campaigns. So when analyzing Twitter, you quickly end up just measuring campaigning and bots. (See for example "Yahoo Predicts America's Political Winners" which is full of poll bashing and "sentiment analysis is much better". They predicted "Romney has over a 90 percent likelihood of winning the nomination, and of winning the South Carolina primary" (he had 28%, while Gingrich had 40% in this primary).

Do you know other such big data fails? I remember roughly that one scientist predicted you could not maintain more than 150 friendships. He actually had only discovered a cap limit in friendster …

As for twitter data, or actually any "big data" collected from the web, I believe that often people even introduce additional bias by the way they collect their data. Few will have all of Twitter. They will have a certain subset they spidered, and this is just yet another bias in their data set.

Splitting the data into a test set or for doing cross validation likely doesn't help much. The other set will have the same bias. And for big data, I need to "compress" my information so heavily that I'm rather unlikely to overfit.

I recently heard this joke, with the big data scientist that discovered there are approximately 6 sexes in the world… and I can this just so imagine to happen… "Male, Female, Orc, Furry, Yes and No".

So what methods do we have to get some statistical validity back into the analysis, in particular when trying to predict something outside of the "big data" dataset?

Best Answer

Your fears are well founded and perceptive. Yahoo and probably several other companies are doing randomized experiments on users and doing it well. But observational data are frought with difficulties. It is a common misperception that problems diminish as the sample size increases. This is true for variance, but bias stays constant as n increases. When the bias is large, a very small truly random sample or randomized study can be more valuable than 100,000,000 observations.

Related Question