Is Sampling Relevant in the Era of Big Data?

data mininglarge datasampling

Or more so "will it be"? Big Data makes statistics and relevant knowledge all the more important but seems to underplay Sampling Theory.

I've seen this hype around 'Big Data' and can't help wonder that "why" would I want to analyze everything? Wasn't there a reason for "Sampling Theory" to be designed/implemented/invented/discovered? I don't get the point of analyzing the entire 'population' of the dataset. Just because you can do it doesn't mean you should (Stupidity is a privilege but you shouldn't abuse it 🙂

So my question is this: Is it statistically relevant to analyze the entire data set? The best you could do would be to minimize error if you did sampling. But is the cost of minimizing that error really worth it? Is the "value of information" really worth the effort, time cost etc. that goes in analyzing big data over massively parallel computers?

Even if one analyzes the entire population, the outcome would still be at best a guess with a higher probability of being right. Probably a bit higher than sampling (or would it be a lot more?) Would the insight gained from analyzing the population vs analyzing the sample differ widely?

Or should we accept it as "times have changed"? Sampling as an activity could become less important given enough computational power 🙂

Note: I'm not trying to start a debate but looking for an answer to understand the why big data does what it does (i.e. analyze everything) and disregard the theory of sampling (or it doesn't?)

Best Answer

In a word, yes. I believe there are still clear situations where sampling is appropriate, within and without the "big data" world, but the nature of big data will certainly change our approach to sampling, and we will use more datasets that are nearly complete representations of the underlying population.

On sampling: Depending on the circumstances it will almost always be clear if sampling is an appropriate thing to do. Sampling is not an inherently beneficial activity; it is just what we do because we need to make tradeoffs on the cost of implementing data collection. We are trying to characterize populations and need to select the appropriate method for gathering and analyzing data about the population. Sampling makes sense when the marginal cost of a method of data collection or data processing is high. Trying to reach 100% of the population is not a good use of resources in that case, because you are often better off addressing things like non-response bias than making tiny improvements in the random sampling error.

How is big data different? "Big data" addresses many of the same questions we've had for ages, but what's "new" is that the data collection happens off an existing, computer-mediated process, so the marginal cost of collecting data is essentially zero. This dramatically reduces our need for sampling.

When will we still use sampling? If your "big data" population is the right population for the problem, then you will only employ sampling in a few cases: the need to run separate experimental groups, or if the sheer volume of data is too large to capture and process (many of us can handle millions of rows of data with ease nowadays, so the boundary here is getting further and further out). If it seems like I'm dismissing your question, it's probably because I've rarely encountered situations where the volume of the data was a concern in either the collection or processing stages, although I know many have

The situation that seems hard to me is when your "big data" population doesn't perfectly represent your target population, so the tradeoffs are more apples to oranges. Say you are a regional transportation planner, and Google has offered to give you access to its Android GPS navigation logs to help you. While the dataset would no doubt be interesting to use, the population would probably be systematically biased against the low-income, the public-transportation users, and the elderly. In such a situation, traditional travel diaries sent to a random household sample, although costlier and smaller in number, could still be the superior method of data collection. But, this is not simply a question of "sampling vs. big data", it's a question of which population combined with the relevant data collection and analysis methods you can apply to that population will best meet your needs.