Solved – Posterior predictive checking for Bernoulli distributed data

bayesianmarkov-chain-montecarlomodel selectionposteriorstan

I am fitting a logistic regression model with 6 continuous response variables (6 coefficients + intercept) and a binary response variable $y$ in rstan for $N$ observations. After setting my actual model and defining the priors for the vector of coefficients $\beta$, I also use a generated quantities block to calculate a vector $y^{rep}$ of the same length of my response vector $y$, drawing from a bernoulli_logit_rng with the probability of success defined by the product of my observed data and the model coefficients, for each observation.

generated quantities{
  vector[N] y_rep;

  for (n in 1:N) {
    y_rep[n] = bernoulli_logit_rng(X[n]* beta);
  }
}

My current understanding is that, for each MCMC iteration, a random draw will be produced with the current coefficient values for that iteration, for each observation. As an example, if I have 1000 post-warmup draws, I will have a matrix of size $1000 \times N$, where each column will contain all the Bernoulli draws for that observation generated through the 1000 MCMC iterations.

However, I am not really sure on how to use this information to perform a posterior predictive check: first, I thought that I could check each column of the matrix, and choose a final outcome for each $y^{rep}_i, i \in 1, \ldots, N$ by selecting the most common class in the 1000 draws. In this way, I will have one final outcome (either $y^{rep}_i = 1$ or $y^{rep}_i=0$).
Then if the distribution of this vector of final outcomes $y^{'rep}$is similar to my observed $y$, then it is an indication that the model is adequate. I checked this on my data and saw that there was a 98% accuracy between the correspondence of $y$ and $y^{'rep}$.

Gelman et al. (2014) mention on p.143 that:

If the model fits, then replicated data generated under the model
should look similar to observed data. To put it another way, the
observed data should look plausible under the posterior predictive
distribution. This is really a self-consistency check: an observed
discrepancy can be due to model misfit or chance.

From this, I understand that they mean that each $y^{rep}_i$ should look similar to the corresponding $y_i$. Then, they say:

Our basic technique for checking the fit of a model to data is to draw
simulated values from the joint posterior predictive distribution of
replicated data and compare these samples to the observed data. Any
systematic differences between the simulations and the data indicate
potential failings of the model.

I am not sure I get this last idea: does this mean that I should draw randomly from my matrix of $y^{rep}$ values and expect the proportion of 0/1 to be similar to my observed $y$? I have also done this for my data: I sampled 10000 values from the matrix of $y^{rep}$ values and see that the proportion of 0/1 is almost identical to the proportion in my observed data $y$.

My question is: what is the correct way to perform the posterior predictive check?

Best Answer

I wouldn't try to pick a single y_rep_i value for each i but actually look at full distributions or summaries of the distributions. To that end, check out the bayesplot package (also from the Stan Development Team) which has a lot of functionality for PPCs and several vignettes that cover some of these concepts and provide examples. You can also get good help at Stan the forum (http://discourse.mc-stan.org)

Related Question