I have been asked on several occasions the question:
What is Big-Data?
Both by students and my relatives that are picking up the buzz around statistics and ML.
I found this CV-post. And I feel that I agree with the only answer there.
The Wikipedia page also has some comments on it, but I am not sure if I really agree with everything there.
EDIT: (I feel that the Wikipedia page lacks in explaining the methods to tackle this and the paradigm I mention below).
I recently attended a lecture by Emmanuel Candès, where he introduced the Big-Data paradigm as
Collect data first $\Rightarrow$ Ask questions later
This is the main difference from hypothesis-driven research, where you first formulate a hypothesis and then collect data to say something about it.
He went a lot into the issues of quantifying reliability of hypotheses generated by data snooping. The main thing I took out of his lecture was that we really need to start to control the FDR and he presented the knockoff method to do so.
I think that CV should have a question on what is Big-Data and what is your definition on it. I feel that there are so many different "definitions", that it is hard to really grasp what it is, or explain it to others, if there is not a general consensus on what it consists of.
I feel that the "definition/paradigm/description" provided by Candès is the closest thing I agree on, what are your thoughts?
EDIT2: I feel that the answer should provide something more than just an explanation of the data itself. It should be a combination of data/methods/paradigm.
EDIT3: I feel that this interview with Michael Jordan could add something to the table as well.
EDIT4: I decided to choose the highest voted answer as the correct one. Although I think that all the answers add something to the discussion and I personally feel that this is more a question of a paradigm of how we generate hypotheses and work with data. I hope this question will serve as a pool of references for those that go looking for what Big-Data is. I hope that the Wikipedia page will be changed to further emphasize the multiple comparison problem and control of FDR.
Best Answer
I had the pleasure of attending a lecture given by Dr. Hadley Wickham, of RStudio fame. He defined it such that
Hadley also believes that most data can at least be reduced to managable problems, and that a very small amount is actually true big data. He denotes this as the "Big Data Mirage".
Slides can be found here.