Solved – Sampling from Bayesian regression predictive posterior

bayesianlogisticposteriorregression

I have the following problem: I want to obtain a predictive posterior distribution for the target logistic regression variable $y$. That is to say, given a combination of explanatory variables $X$, I want to obtain the conditional distribution $P(y|X)$ from the model.

Am I correctly supposing that I need to sample a Bayesian model via MCMC to correctly approximate the predictive posterior? Or is there any easier way? What would the best approach to such a problem be?

Also, given that I effectively have thousands of dummy explanatory variables, is it even possible to handle such a huge parameter number through sampling methods?

I would really appreciate any help here.

Best Answer

Am I correctly supposing that I need to sample a Bayesian model via MCMC to correctly approximate the predictive posterior? Or is there any easier way? What would the best approach to such a problem be?

Your tag says 'logistic regression', so I'm going to assume that your are trying to do Bayesian logistic OR probit regression through some data augmentation scheme. However, what I'm about to say should apply to any model where conjugacy is not an option and you are forced to turn to MCMC.

The only way to correctly (as you put it) study the posterior-predictive distribution is to take into account uncertainty in the regression coefficients & other parameters. The only clear way to do this after using MCMC to get posterior distributions on parameters is to use your MCMC samples. Any other approach would be a (potentially gross) mischaracterization of uncertainty. For example, taking the posterior means on regression coefficients & other parameters and then using your $X$ values would give you a distribution on potential $y \mid X$ values, but this distribution would not contain information about our uncertainty in the regression coefficients & other parameters. It is important that our posterior-predictive variance be inflated by our uncertainty in the parameters. This is seen in the integral $$ \pi(y\mid X) = \int \! \pi(y\mid \theta, X) \pi(\theta)\,d\theta. $$ Uncertainty in $\theta$ will cause the relative tails of $\pi(\theta)$ to be valued more than $0$, and thus the relative tails of $\pi(y \mid X)$ will be valued more than $0$. Using your MCMC samples for the parameters would take into account this uncertainty and would approximate the posterior-predictive with similar Monte Carlo error.

Also, given that I effectively have thousands of dummy explanatory variables, is it even possible to handle such a huge parameter number through sampling methods?

I assume by thousands you mean under billions, in which case I would expect any modern computer to handle a sparse vector inner product efficiently & quickly. But perhaps this is not the case for you, or you just see this as unnecessarily inefficient.

One approach to dealing with this in a statistical sense is to perform variable selection on your model. Selecting the most important variables would (potentially dramatically) reduce the cost of calculating the posterior predictive at each step. Model selection in regression is conceptually straightforward in the Bayesian paradigm, but it is not necessarily quick to describe or implement. I refer you to section 9.3 of this book, but probably any book or outline of Bayesian model selection for regression would suffice.

Related Question