I did something similar recently. Not so much trying to convince others, but doing a small project which allowed me to get a little taste of BMA. What I did was to generate a dataset with a binary response, three independent variables which had an effect on the response and seven variables which did not have any effect on the response. I then compared the BMA results to the frequentist estimates in logistic regression. I think that at least in this case the BMA approach appears to be quite good. If you want to make it more accessible you can always name the variables or something instead of calling them the generic $X$ and $y$.
The R code I used for this is presented below. Hope it can inspire you!
# The sample size
n <- 100
# The 'true' coefficient vector
Beta <- cbind(c(-1.5, 0.45, -3))
# Generate the explanatory variables which have an effect on the outcome
set.seed(1)
X <- cbind(rnorm(n, 0, 1), rnorm(n, 4, 2), rnorm(n, 0.5, 1))
# Convert this into probabilities
prob <- 1/(1+exp(-X %*% Beta))
# Generate some uniform numbers. If the elements are smaller than the corresponding elements in the prob vector, then return 1.
set.seed(2)
runis <- runif(n, 0, 1)
y <- ifelse(runis < prob, 1, 0)
# Add the nonsense variables
X <- cbind(X, rpois(n, 3)) # Redundant variable 1 (x4)
X <- cbind(X, rexp(n, 10)) # Redundant variable 2 (x5)
X <- cbind(X, rbeta(n, 3, 10)) # Redundant variable 3 (x6)
X <- cbind(X, rbinom(n, 10, 0.5)) # Redundant variable 4 (x7)
X <- cbind(X, rpois(n, 40)) # Redundant variable 5 (x8)
X <- cbind(X, rgamma(n, 10, 20)) # Redundant variable 6 (x9)
X <- cbind(X, runif(n, 0, 1)) # Redundant variable 7 (x10)
# The BMA
library(BMA)
model <- bic.glm(X, y, glm.family="binomial", factor.type=FALSE, thresProbne0 = 5, strict = FALSE)
# The frequentist model
model2 <- glm(y~X, family = "binomial")
old.par <- par()
par(mar=c(3,2,3,1.5))
plot(model, mfrow=c(2,5))
par(old.par)
summary(model)
summary(model2)
I think it might help to think of this as a two-level "meta-model". You have some collection of individual models (indexed by $m$), and then you have a meta-model, which is a distribution over the individual models (or equivalently, a distribution over values of $m$).
You can think about the model averaging as working in two steps:
- First, you get the posterior predictive distribution for each model $m$ by integrating out its model-specific parameters $\theta$:
$$ P(y|x, D, m) = \int P(y|x, D, \theta, m)P(\theta| D, m)d\theta $$
- Then you get the posterior predictive distribution for the meta-model, now integrating out the distribution over the models:
$$ P(y|x,D) = \int P(y|x, D, m)P(m|x, D)dm $$
Then in the machine learning context you would make predictions about $y$ based on its posterior predictive distribution given the observed covariates $x$.
To answer your question, the second step is where this is model averaging. When you "integrate out" or "sum out" a parameter (incidentally, you can think of these as the same operation for continuous and discrete distributions respectively), that's equivalent to taking the expected value of some quantity (i.e. averaging) over that parameter. In this case, you're taking the expected value of the posterior density of $y$, which is the definition of a posterior predictive distribution.
As for priors, you're going to have two sets of them in this model: a prior for each model $m$, and a prior for the meta-model over different $m$. They will factor into determining the posterior distributions over parameters that we've integrated out (i.e. $P(\theta|D,m)$ and $P(m|x,D)$).
I will point out that in this model the authors have apparently specified that the posterior over $m$ might depend on the test predictors $x$, but the posterior over $\theta$ does not. That is, $x$ might influence how you weight the different models, but not how you weight the parameters of each individual model. I don't think that's a crazy choice, but it's not the only way to do this.
Okay. An example. I can't think of a machine learning example that's simple, but here's an easier textbook statistics example. In this model the individual models are going to be normal distributions with a fixed variance $\sigma^2$, and a random mean $\mu$. The collection of distributions (the meta-model) is over different values of $\sigma^2$. So here $\theta = \mu$ and $m = \sigma^2$. The standard prior for $\mu|\sigma^2$ is a normal distribution, and then the prior over $\sigma^2$ is an inverse-gamma distribution. You can show that the posterior predictive distribution $y$ over $\mu$ given a fixed value of $\sigma^2$ is another normal distribution with its mean pulled in the direction of the sample mean. Then you integrate out (model average) $\sigma^2$, and the posterior predictive distribution becomes a Student-t distribution over $y$. Essentially, you get something that looks kind of like a normal distribution, but it has fat tails because you've averaged over different possibilities for the variance.
Best Answer
There are two studies that seem to do exactly what you want to do, both published in top field journals:
The problem you face is that the model space is too large to compute all models in reasonable time. This is a well-known problem in the Bayesian Model Averaging literature, but there are computational solutions for it, such as a MCMC algorithm. In
R
, there is for example theBMS
package which implements such methods (see homepage). I'm sure that the above references also provide helpful ideas on how you could solve your problem.