Solved – What to take in consideration when we use Bayesian Methods on Big Data problems

bayesianlarge datamarkov-chain-montecarlopymc

I was reading the book Bayesian Methods for Hackers by Cameron Davidson-Pilon. He use PyMC for examples.

As an experiment, I created a PySpark App with the code example Inferring Behavior from Text-Message Data and I ran it on my computer with some network traffic as an standalone Spark cluster. It didn't take too long (382 ms for 27 MB of raw data). Now, I'm considering to scale this method to the full cluster and the hole data (24 MB per second).

So, my main concern is what do I have to consider for the implementation?

NOTE: In the book we can look at the following warning under A Note on Big Data

Paradoxically, big data's predictive analytic problems are actually
solved by relatively simple algorithms [2][4]. Thus we can argue that
big data's prediction difficulty does not lie in the algorithm used,
but instead on the computational difficulties of storage and execution
on big data. (One should also consider Gelman's quote from above and
ask "Do I really have big data?" ).

The much more difficult analytic problems involve medium data and,
especially troublesome, really small data. Using a similar argument as
Gelman's above, if big data problems are big enough to be readily
solved, then we should be more interested in the not-quite-big enough
datasets.

I thought Davidson-Pilon's warning about using Bayesian methods with big data was a problem reflected in the performance, but it isn't. So, what do I have to take in consideration when I apply Bayesian methods to a big data problem?

(I reviewed the references and actually I think my situation qualifies as a medium data problem.)

Best Answer

Author here. There are a few points I can elaborate on:

  1. Your data is going to be distributed across many computers in a real cluster, so each computer has a fraction of the data. Locally, then, the computations should run fine. But the question next is: how do you combine all these inferences? Sure you could just merge the traces from each computer, but then you haven't gained anything from having all this extra data.

  2. As the data increases, the posterior (typically) becomes more and more peaked and narrow, and the rest of the space is much flatter, which means the MCMC algorithm will have a hard time finding the location of the posterior (assume it starts far away). This really affects convergence.

  3. My note also applies to the number of unknown variables - with more variables, the space the posterior lives in higher and higher dimensions, so convergence of the MCMC is less guaranteed. Typically big data comes with more unknown variables.

Related Question