Hypothesis testing with Gaussian process regression

Say I have two sets of observations (data points), $\mathbf U_1$ and $\mathbf U_2$. Each of these datasets consists of a collection of (not necessarily the same number of) points, e.g. $\mathbf U_1=\{\mathbf u_{1,1},..\mathbf u_{1,N}\}^\top$ where $\mathbf u_{1,i} = \{x_{1,i},y_{1,i}\}$.

I want to model the relationship between $x$ and $y$ using a Gaussian process, and ask whether the inferred (distributions of) functions for $\mathbf U_1$ and $\mathbf U_2$ are more different than one would expect by chance.

Is there an efficient way to do this?

In more detail:

After calculaing my GP regression, I'll have a model for $y_1 = f_1(x_1)$ and $y_2 = f_2(x_2)$, where

$$
f_1 \sim \mathcal{GP}[\mu_1(x),\Sigma_1(x,x')]
$$

$$
f_2 \sim \mathcal{GP}[\mu_2(x),\Sigma_2(x,x')]
$$

The prior kernel can be the same for both regressions, or it can be adapted separately for each of them (whichever would make the answer to this question easier). We can assume that the prior kernel(s) is(are) parameterized by a limited set of hyperparameters and might be somewhat mis-specfied.

Specifically, I'd like to be able to ask: Assume for a null hypothesis that $f_1$ and $f_2$ are the same. If I infer $(\mu_1,\Sigma_1)$ and $(\mu_2,\Sigma_2)$ and find that the two sets of parameters are slightly different, what is the probability that these differences would be observed by chance under the null hypothesis?

Practical constraints

In the problem I'm working with at the moment, GP regression is just barely practical after applying the usual bag of numeric tricks, so any sort of brute-force sampling would be unpleasant. For example, the full-rank posterior covariance might not fit in memory, but I can work with an approximation of it and calculate matrix-vector products, etc. Cubic complexity or higher in $N$ is right-out. (even if we collapse the data to $M$ pseudopoints, cubic complexity still won't do).

Furthermore, we can assume that the individual observations $\mathbf u_{\cdot,i}$ are not independent, and almost surely have some weird dependencies between them (as well as dependencies on other, un-measured variables). So, I don't think the answers to this question can help.

If it matters: I'm not actually working with a linear Gaussian process, but rather a nonlinear, non-Gaussian observations of a latent Gaussian process. I'm fitting the posterior using variational Bayes. One can, if needed, divide out the prior from the posterior to get some Gaussian pseudodata, which then allows this to be viewed as a linear GP regression on the pseudodata (but now with a rather more complicated model for observation errors).

A possibly equivalent question:

GP regression returns a distribution over possible functions $y=f(x)$. However, inference is often approximate, and the kernel typically has a restrictive parameterization in terms of hyperparameters which might not be quite correct.

Under what conditions is it possible to interpret the GP posterior on $f(x)$ as a "probably true" distribution of $f(x)$, such that it is meaningful to ask: what is the probability that $\|f(x)-g(x)\|<\epsilon$, for some other known and specified function $g(x)$?

It seems to me that the outcome of GP regression can't be interpreted in this way in general, since a mis-specified prior or observation model can lead to wildly inaccurate estimates of $f(x)$ and the uncertainty therein.

But, very few models can achieve a complete and accurate specification of the prior covariance. It would seem, then, that what we need are suitable bounds that let us make a reasonable guess as to how wrong our posterior could be, and how confident we should be regarding questions about $\|f(x)-g(x)\|$.

Edit 1:

Since one comment asked me to explain the null hypothesis, I will re-state it again here more compactly:

I have two datasets $\mathbf U_1$ and $\mathbf U_2$; These consist of paired observations $(x,y)$. I'm interested in modelling $y = f(x)$. I use Gaussian process regression to estimate $f_1 \sim \mathcal{GP}[\mu_1(x),\Sigma_1(x,x')]$ and $f_2 \sim \mathcal{GP}[\mu_2(x),\Sigma_2(x,x')]$ from $\mathbf U_1$ and $\mathbf U_2$, respectively.

I've observed that the inferred GP posterior distributions for $f_1$ and $f_2$ are substantially different. However, I'm not sure if this is due to chance. It could be that there is significant uncertainty in the GP regression itself.

My null hypothesis is that are both sampled from the same function $f_1=f_2=f$, and that the observed differences in the inferred posterior distributions is due to noise. What tests/procedures are available for me to reject this null hypothesis? If additional assumptions or modifications to my approach are required to make this concrete, what are they?

Best Answer

This seems to be an open problem. I did not find any literature on hypothesis testing per-se, but I did find two recent works that obtain confidence bounds on the inferred Gaussian Process posterior.

Practical and Rigorous Uncertainty Bounds for Gaussian Process Regression Christian Fiedler, Carsten W. Scherer, Sebastian Trimpe (2021)

Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control Armin Lederer, Jonas Umlauft, Sandra Hirche (2019)

These allow one to state confidence bounds on the identified kernel and other model parameters, and translate them into confidence bounds on the inferred posterior. However, I ended up not using this approach, as it is somewhat complicated.

These paper are quite technical (for me, anyway), so I would not be able to summarize them here any more clearly or succinctly than the original authors.

In more detail:

Practical constraints

A possibly equivalent question:

Edit 1:

Best Answer

Related Solutions

Solved – How to test if two samples are distributed from the same Gaussian process

Related Question