Marginalizing out discrete response variables in Stan

bayesiandiscrete datamissing datamixture-distributionstan

There's been quite a bit of discussion and confusion about how to marginalize out discrete response variables in Stan (e.g. binary or ordinal data). See, for instance:

However, I don't think there is, as yet, a clear answer, so I want to open the discussion more broadly here.

Model

For this example, we'll use a binary logistic regression of response variable $y$ on metric predictor $x$ for $i$ in $\{1,…N\}$ data points:

$$
\begin{equation}
y_i \sim Bernoulli(\eta_i) \\
\eta_i = \text{logit}^{-1}(\alpha + \beta x_i) \\
\alpha \sim \mathcal{N}(0, 5) \\
\beta \sim \mathcal{N}(0, 1)
\end{equation}
$$

Now, consider that some of the $y_i$ variables are missing completely at random with probability $\rho$ — denote them $y_{i}^{miss}$. In this case, the variables can be safely ignored in the likelihood statement, because we can learn everything we need about the model parameters by only considering $y$, i.e. $p(y | \eta, y^{miss}) = p(y | \eta)$.

However, for convenience, let's say someone wants to keep the missing outcome variables in their model, and marginalize them out of the likelihood.

The likelihood for observed variables is just $p(y_{i} | \eta_i) = Bernoulli(y_{i} | \eta_{i})$ as above. But when the variables are missing, we have two possibilities: either the data is 1 or the data is 0.

Option 1:

The accepted answer here, and also discussed here suggests that, when the data is missing, we need a statement that is equivalent to: $p(y_{i}^{miss} = 1 | \eta_{i}) p(\eta_{i}) + p(y_{i}^{miss} = 0 | \eta_{i}) p(\eta_{i})$, which evaluates to $\eta_i \eta_i + (1 – \eta_{i}) (1 – \eta_{i})$.

In Stan, this is what the following line is doing, on the log scale:

target += log_mix(
     eta[i], 
     bernoulli_lpmf(1 | eta[i]), 
     bernoulli_lpmf(0 | eta[i])
);

I simulated some data in R from the following function:

set.seed(12345)

gen_bernoulli_data <- function(N, alpha, beta, rho){
  x <- rnorm(N)
  eta <- alpha + beta * x;
  y <- rbinom(N, 1, prob = plogis(eta))
  missing <- rbinom(N, 1, prob=rho)
  return(data.frame(y=y, missing=missing, x=x))
}

N <- 1000
alpha <- 0
beta <- 1
rho <- 0.5
d <- gen_bernoulli_data(N, alpha, beta, rho)

and fit the model using cmdstanr:

bernoulli_model <- cmdstanr::cmdstan_model("discrete-missing-bernoulli.stan")

fit <- bernoulli_model$sample(
  data = list(
    N = N, 
    y = ifelse(d$missing, -1000, d$y), 
    x = d$x, 
    missing = d$missing, 
    do_marginalization = 1
  ), 
  parallel_chains = 4
)

with the following Stan program:

data{
  int<lower=1> N;
  int<upper=1> y[N];
  vector[N] x;
  int<lower=0, upper=1> missing[N];
  int<lower=0, upper=1> do_marginalization;
}

transformed data{
  int<lower=0> missing_idx[N];
  int<lower=0> N_miss = sum(missing);

  for(n in 1:N) missing_idx[n] = missing[n] * sum(missing[1:n]);
}

parameters{
  real alpha;
  real beta;
  real<lower=0, upper=1> rho;
}

transformed parameters{
  vector[N] eta;
  real lp[N_miss];
  eta = alpha + beta * x;
  for(n in 1:N){
    if(missing[n]){
      real lp10[2];
      lp10[1] = log(inv_logit(eta[n])) + bernoulli_logit_lpmf(1 | eta[n]);
      lp10[2] = log1m(inv_logit(eta[n])) + bernoulli_logit_lpmf(0 | eta[n]);
      lp[missing_idx[n]] = log_sum_exp(lp10);
    }
  }
}

model{
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  rho ~ beta(1, 1);
  missing ~ bernoulli(rho);
  
  for(n in 1:N){
    if(!missing[n])
      target += bernoulli_logit_lpmf(y[n] | eta[n]);
    else{
      if(do_marginalization) target += lp[missing_idx[n]];
    }
  }
}

This results in the following output:

As you can see, $\beta$ is massively inflated, although it looks like $\alpha$ is estimated accurately, with quite a bit of uncertainty (which we might expect). Switching off the marginalization to only evaluate the observed data points results in the following, more accurate plot:

Option 2

From this discussion, it seems that there is some suggestion that the option 1 is wrong, and the correct way to marginalize is to introduce a new vector, call it $\mathbf{\theta}$ where $\theta_{i}$ is the probability that data point $i$ is missing, and is $0$ otherwise.

Now, this feels weird to me, because this seems to be saying we need to evaluate the statement $p(y) = p(y=1) p(y = 1 | \eta) + p(y=0) p(y = 0| \eta) = \theta \eta + (1 – \theta) (1 – \eta)$ but that implies that $p(y=1) \neq \eta$ and $p(y=0) \neq 1 – \eta$, which is wrong.

Or does this imply that we need a statement like $p(y_i=1, \eta) = p(\eta | y_i=1) p(y_i=1)$ and $p(y_i=0, \eta) = p(\eta | y_i=0) p(y_i=1)$?

In either case, this version seems to work. Using the following Stan code:

data{
  int<lower=1> N;
  int<upper=1> y[N];
  vector[N] x;
  int<lower=0, upper=1> missing[N];
  int<lower=0, upper=1> do_marginalization;
}

transformed data{
  int<lower=0> missing_idx[N];
  int<lower=0> N_miss = sum(missing);

  for(n in 1:N) missing_idx[n] = missing[n] * sum(missing[1:n]);
}

parameters{
  real alpha;
  real beta;
  real<lower=0, upper=1> rho;
  real<lower=0, upper=1> theta[N_miss];
}

transformed parameters{
  vector[N] eta;
  real lp[N_miss];
  eta = alpha + beta * x;
  for(n in 1:N){
    if(missing[n]){
      real lp10[2];
      lp10[1] = log(theta[missing_idx[n]]) + bernoulli_logit_lpmf(1 | eta[n]);
      lp10[2] = log1m(theta[missing_idx[n]]) + bernoulli_logit_lpmf(0 | eta[n]);
      lp[missing_idx[n]] = log_sum_exp(lp10);
    }
  }
}

model{
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  rho ~ beta(1, 1);
  theta ~ beta(1, 1);
  missing ~ bernoulli(rho);
  
  for(n in 1:N){
    if(!missing[n]){
      target += bernoulli_logit_lpmf(y[n] | eta[n]);
    }
    else{
      if(do_marginalization) target += lp[missing_idx[n]];
    }
  }
}

we recover the parameters, which are essentially identical to the version of the model without the marginalization:

However, this feels odd to me, because while I'd expect to recover the parameters, I might expect there to me more uncertainty when marginalizing over the missing data, especially in the intercept. Is this intuition correct?

Summary

Can anyone shed light on what the second option above works, and provide the mathematical intuition?

EDIT: I should add that there is consensus on how to marginalize out missing discrete covariates. In that case, where $y$ is observed, and there is a discrete covariate $x = \{0, 1\}$, we want $p(y) = p(y | x = 0) p(x = 0) + p(y | x=1) p(x=1)$, which is gleaned from the total law of probability. For instance, see https://elevanth.org/blog/2018/01/29/algebra-and-missingness/

Best Answer

There is confusion and also some wrong math.

The gist is that in your example marginalizing unobserved responses doesn't make sense: the missing observations don't contribute to the likelihood and so they don't contribute to the posterior.

Note: This is not necessarily always the case. In the comments you point to a different example from the Stan manual, where observations are censored below a threshold U, so the proportion of missing observations contains information about how close the threshold U is to the population location parameter μ.

So let's assume that observations are missing completely at random or missing at random (that is, the missingness can be explained by what we know, eg. the predictors x but doesn't depend on what we don't know, eg. the parameters θ).

In your example can integrate out $p(y^{miss} | x^{obs},\theta)$ because the support of the probability density function / probability mass function is not constrained.

$$ \int p(y^{miss} | x^{obs},\theta)\pi(\theta)~dy^{miss} \propto \pi(\theta|x^{obs}) = \pi(\theta) $$

Contrast this with why it is useful to marginalize out unobserved predictors for observed responses. These extra observations do contain information about the model parameters.

$$ \int p(y^{obs} | x^{miss},\theta)\pi(\theta)~dx^{miss} \propto \pi(\theta | y^{obs}) \color{white}{= \pi(\theta)} $$

Note: The missing observations do tell us something about the data collection process. For example, in your simulation you use them to estimate the missingness rate $\rho$.

Now let's look at the two analyses in more detail.

Option 1

There is an error right at the start of your derivation. By the law of total probability: $$ p(y_i^{miss} = 1 | \eta_i)p(\eta_i) + p(y_i^{miss} = 0 | \eta_i)p(\eta_i) = p(\eta_i) $$ Since we didn't observe $y_i^{miss}$, the posterior for $\eta_i$ is the same as the prior. So don't add the log_mix component to the target. You will get valid posterior estimates for $\alpha$, $\beta$ and $\rho$.

Option 2

You introduce additional parameters $\theta_i$ for the missing data points. Essentially the model for missing observation becomes:

$$ \left[\theta_i\eta_i\right]^{y_i^{miss}}\left[(1-\theta_i)(1-\eta_i)\right]^{1-y_i^{miss}} $$

As you point out, this doesn't make sense. The success probability in this over-parameterized model is $\theta_i\eta_i$ and the failure probability is $(1-\theta_i)(1-\eta_j)$, so the probabilities don't sum up to 1 (are not normalized). And only the $\eta_i$s depend on $x_i$.

Still it seems to "work" in the sense that you get correct inferences for $\alpha$, $\beta$ and $\rho$. Why? Again, the missing observations are irrelevant as they bring no information. It doesn't really matter how you choose to parameterize their probabilities of success and failure. And it's easy to check that posterior of the extra parameters $\theta_i$ is the same as their prior.

Model

Option 1:

Option 2

Summary

Best Answer

Related Solutions

Solved – Bernoulli or binomial likelihood, beta prior. Marginalize over success probability

Related Question