Machine Learning – How to Set Prior for Covariate Coefficients in Bayesian Regression

bayesianclassificationfeature selectionfeature-engineeringmachine learning

I have a data set of around 1 million rows and around 30 possible features. My main objective is to build a classification model to predict probabilities for an output variable of interest. It is tempting for me to just train an ensemble model (random forest, gradient boosting, etc.) and drop or re-engineer features to achieve the smallest possible validation score before applying the model to a test set. I would like to bring the number of features down further; I have already computed the mutual information to trim the possible features down to around 30.

However, in the application area I'm working on, Bayesian statistics has a long tradition. I would like to pursue a Bayesian approach if I could and use a probit model. To be explicit, let $S$ be my output where $S$ is either 0 or 1. I will model $S$ as Bernoulli with probability $p$ where $p = \beta_0 + \sum_{i=1}^n \beta_i x_i$ where $n$ us the number of features, $\beta_i$ are the coefficients, and $x_i$ are the feature values. Here are the challenges I have to pursue a Bayesian regression approach:

  1. The number of features is too large. Doing MCMC in 30 dimensions might not be practical.
  2. I have no idea what to use for the prior distribution of $\beta_0,\dots,\beta_n$.

Here are my questions:

  1. I have read that it is common to use an independent Gaussian prior for each of the coefficients. Do I set the mean and the standard deviation of the prior to be just 0 or 1? Given that my prior has infinite support, and that I have a lot of data, my data should then overwhelm the influence of my prior, correct?
  2. How can I deal with reducing the number of features in this Bayesian setting? Suppose that from training an ensemble classifier, I was able to trim the number of features to say 10. Can I then use these features for my Bayesian approach? This does not violate the Bayesian philosophy because my choice of prior is still uninfluenced by my data, correct?
  3. If the dimension of the features is still an issue, are there well known closed form expressions for the MAP of the coefficients in probit regression?

Hoping for some suggestions/insights.
Thanks!

Best Answer

  1. I imagine that with 1 million data points, the prior information is more or less negligible, especially if the model has no complex structure (e.g. hierarchical structure). The likelihood should -- assuming the model is simple -- overwhelm the prior.

  2. Why remove features at all? Do you have some other constraint to satisfy? Given the number of observations vs the number of features, I don't see the need to select any. Additionally, MCMC in 30 dimensions is not really I problem (I routinely fit models with 10x that number of parameters). The bottleneck would be the likelihood evaluation.

  3. If you're going to use MAP, you might as well just fit a logistic regression with an l2 or l1 penalty. The choice of l1/l2 corresponds to a particular parameterization of a model with laplacian/normal priors, so you'd be saving yourself time and trouble. I say this because inference is not your main objective so uncertainty estimation is moot.

Related Question