Solved – What’s the Bayesian equivalent of a general goodness of fit test

bayesiangoodness of fit

I have two data sets, one from a set of physical observations (temperatures), and one from an ensemble of numerical models. I'm doing a perfect-model analysis, assuming the model ensemble represents a true, independent sample, and checking to see if the observations are drawn from that distribution. The statistic I've calculated is normalised, and should theoretically be a standard normal distribution. Of course it's not perfect, so I want to test for goodness of fit.

Using frequentist reasoning, I could calculate a Cramér-von Mises statistic (or Kolmogorov-Smirnov, etc.), or similar, and look up the value in a table to get a p-value, to help me decide how unlikely the value I see is, given the observations are the same as the model.

What would the Bayesian equivalent of this process be? That is, how do I quantify the strength of my belief that these two distributions (my calculated statistic and the standard normal) are different?

Best Answer

I would suggest the book Bayesian Data Analysis as a great source for answering this question (in particular chapter 6) and everything I am about to say. But one of the usual ways that Bayesians attack this problem is by using Posterior Predictive P-values (PPPs). Before I jump into how PPPs would solve this problem let me first define the following notation:

Let $y$ be the observed data and $\theta$ be the vector of parameters. We define $y^{\text{rep}}$ as the replicated data that could have been observed, or, to think predictively, as the data we would see tomorrow if the experiment that produced $y$ today were replicated with the same model and the same value of $\theta$ that produced the observed data.

Note, we will define the distribution of $y^{\text{rep}}$ given the current state of knowledge with the posterior predictive distribution $$p(y^{\text{rep}}|y)=\int_\Theta p(y^{\text{rep}}|\theta)p(\theta|y)d\theta$$

Now, we can measure the discrepancy between the model and the data by defining test quantities, the aspects of the data we wish to check. A test quantity, or discrepancy measure, $T(y,\theta)$, is a scalar summary of parameters and data that is used as a standard when comparing data to predictive simulations. Test quantities play the role in Bayesian model checking that test statistics play in classical testing. We define the notation $T(y)$ for a test statistic, which is a test quantity that depends only on data; in the Bayesian context, we can generalize test statistics to allow dependence on the model parameters under their posterior distribution.

Classically, the p-value for the test statistic $T(y)$ is $$p_C=\text{Pr}(T(y^{\text{rep}})\geq T(y)|\theta)$$ where the probability is taken over the distribution of $y^{\text{rep}}$ with $\theta$ fixed.

From a Bayesian perspective, lack of fit of the data with respect to the posterior predictive distribution can be measured by the tail-area probability, or p-value, of the test quantity, and computed using posterior simulations of $(\theta,y^{\text{rep}})$. In the Bayesian approach, test quantities can be functions of the unknown parameters as well as data because the test quantity is evaluated over draws from the posterior distribution of the unknown parameters.

Now, we can define the Bayesian p-value (PPPs) as the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity: $$p_B=\text{Pr}(T(y^{\text{rep}},\theta)\geq T(y,\theta)|y)$$ where the probability is taken over the posterior distribution of $\theta$ and the posterior predictive distribution of $y^{\text{rep}}$ (that is, the joint distribution, $p(\theta,y^{\text{rep}}|y)$): $$p_B=\iint_\Theta I_{T(y^{\text{rep}},\theta)\geq T(y|\theta)}p(y^{\text{rep}}|\theta)p(\theta|y)dy^{\text{rep}}d\theta,$$ where $I$ is the indicator function. In practice though we usually compute the posterior predictive distribution using simulations.

If we already have, say, $L$ simulations from the posterior distribution of $\theta$, then we can just draw one $y^{\text{rep}}$ from the predictive distribution for each simulated $\theta$; we now have $L$ draws from the joint posterior distribution, $p(y^{\text{rep}},\theta|y)$. The posterior predictive check is the comparison between the realized test quantities $T(y,\theta^l)$ and the predictive test quantities $T(y^{\text{rep}l},\theta^l)$. The estimated p-value is just the proportion of these $L$ simulations for which the test quantity equals or exceeds its realized value; that is, for which $$T(y^{\text{rep}l},\theta^l)\geq T(y,\theta^l)$$ for $l=1,...,L$.

In contrast to the classical approach, Bayesian model checking does not require special methods to handle "nuisance parameters." By using posterior simulations, we implicitly average over all the parameters in the model.

An additional source, Andrew Gelman also has a very nice paper on PPP's here: http://www.stat.columbia.edu/~gelman/research/unpublished/ppc_understand2.pdf

Related Question