Big Data Explained – What Exactly Is It?

large data

I have been asked on several occasions the question:

What is Big-Data?

Both by students and my relatives that are picking up the buzz around statistics and ML.

I found this CV-post. And I feel that I agree with the only answer there.

The Wikipedia page also has some comments on it, but I am not sure if I really agree with everything there.

EDIT: (I feel that the Wikipedia page lacks in explaining the methods to tackle this and the paradigm I mention below).

I recently attended a lecture by Emmanuel Candès, where he introduced the Big-Data paradigm as

Collect data first $\Rightarrow$ Ask questions later

This is the main difference from hypothesis-driven research, where you first formulate a hypothesis and then collect data to say something about it.

He went a lot into the issues of quantifying reliability of hypotheses generated by data snooping. The main thing I took out of his lecture was that we really need to start to control the FDR and he presented the knockoff method to do so.

I think that CV should have a question on what is Big-Data and what is your definition on it. I feel that there are so many different "definitions", that it is hard to really grasp what it is, or explain it to others, if there is not a general consensus on what it consists of.

I feel that the "definition/paradigm/description" provided by Candès is the closest thing I agree on, what are your thoughts?

EDIT2: I feel that the answer should provide something more than just an explanation of the data itself. It should be a combination of data/methods/paradigm.

EDIT3: I feel that this interview with Michael Jordan could add something to the table as well.

EDIT4: I decided to choose the highest voted answer as the correct one. Although I think that all the answers add something to the discussion and I personally feel that this is more a question of a paradigm of how we generate hypotheses and work with data. I hope this question will serve as a pool of references for those that go looking for what Big-Data is. I hope that the Wikipedia page will be changed to further emphasize the multiple comparison problem and control of FDR.

Best Answer

I had the pleasure of attending a lecture given by Dr. Hadley Wickham, of RStudio fame. He defined it such that

  • Big Data: Can't fit in memory on one computer: > 1 TB
  • Medium Data: Fits in memory on a server: 10 GB - 1 TB
  • Small Data: Fits in memory on a laptop: < 10 GB

Hadley also believes that most data can at least be reduced to managable problems, and that a very small amount is actually true big data. He denotes this as the "Big Data Mirage".

  • 90% Can be reduced to a small/ medium data problem with subsetting/sampling/summarising
  • 9% Can be reduced to a very large number of small data problems
  • 1% Is irreducibly big

Slides can be found here.