Context:
My response looks like a mixture model with two classes as you can see on the picture.
I have a couple of predictors that perform relatively well in a linear regression (Bayesian or not). In the Bayesian context I am using MCMC sampling with stan like this:
\begin{align}
\beta \sim {\rm Student}(7, 0, 20)& \\
\alpha \sim \mathcal{N}(0, 1)& \\
\sigma \sim \mathcal{N}(0, 1)& \\
y|X \sim \mathcal{N}(X\beta + \alpha, \sigma)&
\end{align}
where $X$ are my predictors.
Here is an excerpt of the code in stan
:
library(rstanarm)
model.glm <- stan_glm(y~poly(x1,4)+I(x2-x3), data=data, subset=train_index,
family=gaussian(link="identity"), prior=student_t(7,0,20),
chains=5)
As you can imagine, my posterior is going to look like a normal distribution, which is confirmed by this chart:
predict <- posterior_predict(model.glm,data[-train_index])
ppc_dens_overlay(data[-train_index]$y,predict[1:300,])
Problem:
I would like my posterior to show the mixture model. However, I am having some issue to model it as I am fairly new to Bayesian stats.
Question:
How do you model a mixture model with predictor in MCMC sampling?
Progress so far:
I thought that I could use a multinomial prior (it could be binomial for my case but if I can make it generic why not!) with two classes, but then I am not sure where to go from there. This is the start that I tried to model but got stuck.
\begin{align}
\mu \sim {\rm Multinomial}(\tau, \gamma)& \\
X_j \sim \mathcal{N}(\mu_i, \sigma\star)& \\
Y|X \sim \mathcal{N}(X\beta, \sigma)&
\end{align}
Best Answer
Likelihood
For a mixture of two Gaussians, the likelihood can be written as: $$ y_i \sim \pi N(y_i|\alpha_0 + x_i\beta, \sigma_0) + (1-\pi) N(y_i|\alpha_1 + x_i\beta, \sigma_1) $$ where $\pi \in [0, 1]$.
This is fine, but having two components in the likelihood makes sampling more difficult. A trick when dealing with mixture models is to augment the model with indicator variables that indicate to which class an observation belongs. So, for example, $\delta_i=0$ if the observation belongs to the first class, and $\delta_i=1$ if the observation belongs to the second class. If $p(\delta_i=0)=\pi$, the likelihood could be written as $$ y_i |\delta_i \sim \left[N(y_i|\alpha_0 + x_i\beta, \sigma_0)\right]^{1-\delta_i} \times \left[N(y_i|\alpha_1 + x_i\beta, \sigma_1)\right]^{\delta_i}, $$ and marginalizing out $\delta_i$ would lead to the recovery of the original likelihood.
Priors
In the model below, $\sigma^2_0$ and $\sigma^2_1$ have reference priors. Normal priors aren't the best choice for $\sigma^2_0$ and $\sigma^2_1$ because the normal distribution has support on the real line, but the scale parameters can only take on positive values.
Priors: \begin{align*} \alpha_0 & \sim N(0, \tau_{\alpha_0}^2) \\ \alpha_1 & \sim N(0, \tau_{\alpha_1}^2) \\ \beta & \propto 1 \\ p(\sigma_0) & \propto \frac{1}{\sigma_0^2} \\ p(\sigma_1) & \propto \frac{1}{\sigma_1^2} \\ \pi & \sim Unif(0, 1) \qquad \text{i.e. } Beta(1, 1). \end{align*}
MCMC Sampling
The joint distribution up to a proportionality constant is given by \begin{align*} p(\alpha_0, \alpha_1, \beta, \sigma_0^2, \sigma_1^2 | \cdot) \propto & \ \exp\left( \frac{-\alpha_0^2}{2\tau_{\alpha_0}^2} \right) \exp\left( \frac{-\alpha_1^2}{2\tau_{\alpha_1}^2} \right) \frac{1}{\sigma_0^2} \frac{1}{\sigma_1^2} \\ & \times \prod_{i=1}^n \left[ \frac{1}{\sqrt{\sigma_0^2}} \exp\left( \frac{-(y_i - (\alpha_0 + x_i\beta))^2}{2 \sigma_0^2} \right)\right]^{1-\delta_i} \left[ \frac{1}{\sqrt{\sigma_1^2}} \exp\left( \frac{-(y_i - (\alpha_1 + x_i\beta))^2}{2 \sigma_1^2} \right)\right]^{\delta_i} \end{align*}
After some algebra it's possible to find the conditional distributions of the parameters. In this case, all the full conditionals have closed forms, so a Gibbs sampler can be used to get draws from the joint posterior.
Full conditionals
\begin{align*} \sigma_0^2 | \cdot &\sim IG \left( \frac{n_0}{2}, \frac{1}{2} \sum_{i|\delta_i=0} \left( y_i - (\alpha_0 + x_i\beta) \right)^2 \right) \\ \sigma_1^2 | \cdot &\sim IG \left( \frac{n_1}{2}, \frac{1}{2} \sum_{i|\delta_i=1} \left( y_i - (\alpha_1 + x_i\beta) \right)^2 \right) \\ \end{align*} where $i|\delta_i=0$ is used to denote the set of $i$ such that $\delta_i=0$, and $n_0$ is the count of the $\delta_i$ where $\delta_i=0$. The same type of notation is used for $i|\delta_i=1$ and $n_1$.
Conditional on the $\delta_i$, the posterior distribution for $\beta$ is \begin{align*} \beta | \cdot & \sim N(m, s^2) \\ \text{with} & \\ m & =\left( \sum_{i|\delta_i=0} x_i^2 \sigma_1^2 + \sum_{i|\delta_i=1} x_i^2 \sigma_0^2\right)^{-1} \left( \sigma_1^2 \sum_{i|\delta_i=0}(y_i x_i - \alpha_0 x_i) + \sigma_0^2 \sum_{i|\delta_i=1}(y_i x_i - \alpha_1 x_i) \right) \\ s^2 & = \frac{\sigma_0^2 \sigma_1^2}{\sum_{i|\delta_i=0} x_i^2 \sigma_1^2 + \sum_{i|\delta_i=1} x_i^2 \sigma_0^2} \end{align*}
The conditional distributions for $\alpha_0$ and $\alpha_1$ are also normal \begin{align*} \alpha_0 & \sim N\left((\sigma_0^2 + n_0 \tau_0^2)^{-1} \tau_0^2 \sum_{i|\delta_i=0}(y_i - x_i \beta), \, \frac{\tau_0^2 \sigma_0^2}{\sigma_0^2 + n_0 \tau_0^2} \right) \\ \alpha_1 & \sim N\left((\sigma_1^2 + n_1 \tau_1^2)^{-1} \tau_1^2 \sum_{i|\delta_i=1}(y_i - x_i \beta), \, \frac{\tau_1^2 \sigma_1^2}{\sigma_1^2 + n_1 \tau_1^2} \right). \end{align*}
The indicator variables for the class membership also need to be updated. These are Bernoulli with probabilities proportional to \begin{align*} p(\delta_i=0|\cdot) & \propto N(y_i|\alpha_0 + x_i \beta, \, \sigma_0^2) \\ p(\delta_i=1|\cdot) & \propto N(y_i|\alpha_1 + x_i \beta, \, \sigma_1^2). \\ \end{align*}
Results
The MCMC predictions are bimodal as intended
Here's the inference on the posterior distributions of the parameters, with the true values shown by the vertical red lines
A couple comments
I suspect you know this, but I wanted to emphasize that the model I've shown here only has a single regression coefficient $\beta$ for both classes. It might not be reasonable to assume that both populations respond to the covariate in the same way.
There are no restrictions on $\alpha_0$ and $\alpha_1$ in the prior specification, so in many cases there will be identifiability issues which lead to label switching. As the MCMC runs, $\alpha_0$ may sometimes be larger than $\alpha_1$, and other times $\alpha_1$ may be larger than $\alpha_0$. The changing values of $\alpha$ will affect the $\delta_i$, causing them to swap labels from 0 to 1 and vice versa. These identifiability issues aren't a problem as long as your interest is only in the posterior predictive or inference on $\beta$. Otherwise changes may need to made in the prior, for example by forcing $\alpha_0 \leq \alpha_1$.
I hope this is helpful. I included the code I used. I believe this can be done in Stan easily as well, but I haven't used Stan in a while so I'm not sure. If I have time later I may look into it.
Edit: Results using Stan
I added some code for a similar model using Stan in case that is useful. Here's the same plot using the Stan model: