Solved – Simple example of how “Bayesian Model Averaging” actually works

bayesianmachine learningmeanmodelprobability

I'm trying to follow this tutorial on Bayesian Model Averaging by putting it in context of machine-learning and the notations that it generally uses (i.e.):

X_train: Training Array; dims = $(n, m)$;

y_train Target Vector; dims = $(n, )$ that you fit with the Training Array (correct values);

x: input vector of attributes for a sample; dims = $(m,)$; and

y: output prediction value; $(1,)$ scalar [scalar for simplicity] of prediction values).

These are all described below in the context of Bayesian…

Source describes this as a Class of models indexed by $m$:
$$P(y| x,\theta, m)$$
$\theta$ : Set of model parameters;

$m$ : The model index in a set of models

Bayesian Model Selection:

$$P(y|x,D) = $$

$x$ : Input Data : $(n_{test}, m)$ shaped input array (rows = samples, cols = attributes);

$y$ : Output Prediction : $(n_{test},)$ length output vector of predictions based on $x$;

$D$ : Training Data : A tuple containing (i) $(n_{train}, m)$ array of (rows = samples, cols = attributes); and (ii) $(n_{train},)$ length vector containing the actual value/category described by training array

(please let me know if this is confusing and I will elaborate)

The video says that this averages over the probabilities that are predicted for each of the models. The weights that you average with are $P(m|x,D)$ posterior distributions on $m$ given $D$.

My confusion:

Can someone please describe how this is averaging over models? Do you end up with a posterior that is created with all of the models? Where does the prior go in this context?

How does integrating over all the models average them? From what I remember, integrating gives you area under the curve but in statistics I often hear the term "summing/integrating out" parameters/variables. What does that mean exactly?

Please provide a simple example so I can understand how this works 🙂 It will definitely be useful for people trying to understand how Bayesian Model Averaging works exactly. I will put a link to this on that video because I know other people were confused as well.

Best Answer

I think it might help to think of this as a two-level "meta-model". You have some collection of individual models (indexed by $m$), and then you have a meta-model, which is a distribution over the individual models (or equivalently, a distribution over values of $m$).

You can think about the model averaging as working in two steps:

First, you get the posterior predictive distribution for each model $m$ by integrating out its model-specific parameters $\theta$:

$$ P(y|x, D, m) = \int P(y|x, D, \theta, m)P(\theta| D, m)d\theta $$

Then you get the posterior predictive distribution for the meta-model, now integrating out the distribution over the models:

$$ P(y|x,D) = \int P(y|x, D, m)P(m|x, D)dm $$

Then in the machine learning context you would make predictions about $y$ based on its posterior predictive distribution given the observed covariates $x$.

To answer your question, the second step is where this is model averaging. When you "integrate out" or "sum out" a parameter (incidentally, you can think of these as the same operation for continuous and discrete distributions respectively), that's equivalent to taking the expected value of some quantity (i.e. averaging) over that parameter. In this case, you're taking the expected value of the posterior density of $y$, which is the definition of a posterior predictive distribution.

As for priors, you're going to have two sets of them in this model: a prior for each model $m$, and a prior for the meta-model over different $m$. They will factor into determining the posterior distributions over parameters that we've integrated out (i.e. $P(\theta|D,m)$ and $P(m|x,D)$).

I will point out that in this model the authors have apparently specified that the posterior over $m$ might depend on the test predictors $x$, but the posterior over $\theta$ does not. That is, $x$ might influence how you weight the different models, but not how you weight the parameters of each individual model. I don't think that's a crazy choice, but it's not the only way to do this.

Okay. An example. I can't think of a machine learning example that's simple, but here's an easier textbook statistics example. In this model the individual models are going to be normal distributions with a fixed variance $\sigma^2$, and a random mean $\mu$. The collection of distributions (the meta-model) is over different values of $\sigma^2$. So here $\theta = \mu$ and $m = \sigma^2$. The standard prior for $\mu|\sigma^2$ is a normal distribution, and then the prior over $\sigma^2$ is an inverse-gamma distribution. You can show that the posterior predictive distribution $y$ over $\mu$ given a fixed value of $\sigma^2$ is another normal distribution with its mean pulled in the direction of the sample mean. Then you integrate out (model average) $\sigma^2$, and the posterior predictive distribution becomes a Student-t distribution over $y$. Essentially, you get something that looks kind of like a normal distribution, but it has fat tails because you've averaged over different possibilities for the variance.

Related Solutions

Bayesian vs Cross-Validation – Best Approach for Model Selection

Are these approaches suitable for solving this problem (deciding how many parameters to include in your model, or selecting among a number of models)?

Either one could be, yes. If you're interested in obtaining a model that predicts best, out of the list of models you consider, the splitting/cross-validation approach can do that well. If you are interested in known which of the models (in your list of putative models) is actually the one generating your data, then the second approach (evaluating the posterior probability of the models) is what you want.

Are they equivalent? Probably not. Will they give the same optimal model under certain assumptions or in practice?

No, they are not in general equivalent. For example, using AIC (An Information Criterion, by Akaike) to choose the 'best' model corresponds to cross-validation, approximately. Use of BIC (Bayesian Information Criterion) corresponds to using the posterior probabilities, again approximately. These are not the same criterion, so one should expect them to lead to different choices, in general. They can give the same answers - whenever the model that predicts best also happens to be the truth - but in many situations the model that fits best is actually one that overfits, which leads to disagreement between the approaches.

Do they agree in practice? It depends on what your 'practice' involves. Try it both ways and find out.

Other than the usual philosophical difference of specifying prior knowledge in Bayesian models etc., what are the pros and cons of each approach? Which one would you choose?

It's typically a lot easier to do the calculations for cross-validation, rather than compute posterior probabilities
It's often hard to make a convincing case that the 'true' model is among the list from which you are choosing. This is a problem for use of posterior probabilities, but not cross-validation
Both methods tend to involve use of fairly arbitrary constants; how much is an extra unit of prediction worth, in terms of numbers of variables? How much do we believe each of the models, a priori?
- I'd probably choose cross-validation. But before committing, I'd want to know a lot about why this model-selection was being done, i.e. what the chosen model was to be used for. Neither form of model-selection may be appropriate, if e.g. causal inference is required.

Solved – Simple example that shows the advantages of Bayesian Model Averaging (BMA)

I did something similar recently. Not so much trying to convince others, but doing a small project which allowed me to get a little taste of BMA. What I did was to generate a dataset with a binary response, three independent variables which had an effect on the response and seven variables which did not have any effect on the response. I then compared the BMA results to the frequentist estimates in logistic regression. I think that at least in this case the BMA approach appears to be quite good. If you want to make it more accessible you can always name the variables or something instead of calling them the generic $X$ and $y$.

The R code I used for this is presented below. Hope it can inspire you!

# The sample size
n <- 100

# The 'true' coefficient vector
Beta <- cbind(c(-1.5, 0.45, -3))

# Generate the explanatory variables which have an effect on the outcome
set.seed(1)
X <- cbind(rnorm(n, 0, 1), rnorm(n, 4, 2), rnorm(n, 0.5, 1))

# Convert this into probabilities
prob <- 1/(1+exp(-X %*% Beta))

# Generate some uniform numbers. If the elements are smaller than the corresponding elements in the prob vector, then return 1.
set.seed(2)
runis <- runif(n, 0, 1)
y <- ifelse(runis < prob, 1, 0)

# Add the nonsense variables
X <- cbind(X, rpois(n, 3))        # Redundant variable 1 (x4)
X <- cbind(X, rexp(n, 10))        # Redundant variable 2 (x5)
X <- cbind(X, rbeta(n, 3, 10))    # Redundant variable 3 (x6)
X <- cbind(X, rbinom(n, 10, 0.5)) # Redundant variable 4 (x7)
X <- cbind(X, rpois(n, 40))       # Redundant variable 5 (x8)
X <- cbind(X, rgamma(n, 10, 20))  # Redundant variable 6 (x9)
X <- cbind(X, runif(n, 0, 1))     # Redundant variable 7 (x10)


# The BMA
library(BMA)
model <- bic.glm(X, y,  glm.family="binomial", factor.type=FALSE, thresProbne0 = 5, strict = FALSE)

# The frequentist model
model2 <- glm(y~X, family = "binomial")

old.par <- par()
par(mar=c(3,2,3,1.5))
plot(model, mfrow=c(2,5))
par(old.par)

summary(model)
summary(model2)

Best Answer

Related Solutions

Bayesian vs Cross-Validation – Best Approach for Model Selection

Solved – Simple example that shows the advantages of Bayesian Model Averaging (BMA)

Related Question