Solved – How to calibrate the point-by-point variances for Gaussian process regression

gaussian processpythonregressiontime series

I have a large-ish set of unevenly-spaced time-series data from instruments around the world, for which I'm using Gaussian process regression to do interpolation and short-term future prediction. My kernel is a combination of squared-exponential and periodic components to model the trends at different time-scales, and it's working reasonably well. At this point I'm working on optimizing the model and, in particular, making sure that the variances that come out of it are reasonable.

The measurements that I'm using for input come with a "confidence score" in arbitrary units 0 (worst) to 100 (best), and I would like to pass these as variances into the GPR (sklearn GaussianProcessRegressor alpha parameter), but I don't have any guidance for how much error any given confidence score indicates, so I need to discover a function on my own that will map CS to variance. What's the best way to go about this?

So far I've had the fairly obvious idea to take batches of the data, fit the GPR model to the batch, and for each point in the batch record its CS and the difference between the measurement and the predicted mean at the same point, then do curve fitting to map CS to mean squared error. However, the devil is in the details. Am I better off including all points or using a left-out set? Should I iterate, doing successive runs with variances learned from the previous epoch, or use fixed variances throughout? (I'm worried that the former approach could diverge, because larger input variances lead to larger output variances, which lead to learning larger input variances). Should I be doing something more Bayesian with log-likelihood or otherwise take into account where the point falls in the predicted distribution, rather than just distance from the predicted mean? I'm afraid I've gone a bit beyond my mathematical knowledge here.

In case it's relevant, I'm using scikit-learn GaussianProcessRegressor in my actual prediction, but am open to pretty much anything.

Best Answer

Standard Gaussian process (GP) regression assumes constant noise variance, whereas it seems you want to allow it to vary. So, this is a heteroscedastic GP regression problem. Similar problems have been addressed in the literature (see references below). For example, Goldberg et al. (1998) treat the noise variance as an unknown function of the input, modeled with a second GP.

What distinguishes the problem here from those papers is that the noise variance here is a function of given confidence scores, rather than a direct function of the inputs. A principled way of attacking the problem is to simultaneously learn a function mapping confidence scores to noise variance, together with a GP regression model using those noise variances. Below, I'll describe one way to do this, using an empirical Bayes approach.

Model

Let $X=\{x_1, \dots, x_n\}$ denote the training inputs, with corresponding outputs in the vector $y \in \mathbb{R}^n$ and confidence scores in the vector $c = [c_1, \dots, c_n]^T$. Let's use the model:

$$y_i = f(x_i) + \epsilon_i$$

$$f \sim \mathcal{GP}(m, k)$$

$$\epsilon_i \sim \mathcal{N}(0, \sigma^2_i)$$

Outputs are related to inputs by an unknown, nonlinear function $f$. A GP prior is placed on $f$ with mean function $m$ and covariance function $k$ (with parameters $\theta$). I'll assume that the mean function is zero, as is common practice. But, an arbitrary mean function could be incorporated following the same steps as in standard GP regression. Each observed output $y_i$ is produced by adding Gaussian noise $\epsilon_i$ to the function output $f(x_i)$. Note that the noise is not identically distributed over data points; each has its own noise variance $\sigma^2_i$, which is assumed to be a function of the confidence score $c_i$:

$$\sigma^2_i = \exp g(c_i; \eta)$$

I'll call $g$ the 'confidence function', which is parameterized by some vector $\eta$. Given a confidence score, it outputs the log noise variance; this constrains the noise variance to be positive. If a good parametric form for $g$ is known, this can be used. Otherwise, let $g$ be a member of some flexible class of function approximators. For example, it could be a linear combination of basis functions, neural net, or spline (as in the example below). One could also constrain $g$ to be monotonically decreasing, so higher confidence scores always correspond to lower noise variance. Enforcing this assumption might make learning more efficient, but may or may not be necessary in practice.

Note that everything here is similar to standard GP regression. I've simply replaced the typical, constant noise variance with noise variance that's a function of the confidence score.

Learning

Fitting the model consists of finding values for the covariance function parameters $\theta$ and the confidence function parameters $\eta$. This can be done using empirical Bayes, which is a common strategy for GP regression. That is, maximize the (log) marginal likelihood of the observed outputs (a.k.a. the evidence):

$$\max_{\theta, \eta} \ \log p(y \mid X, \theta, \eta)$$

The marginal likelihood $p(y \mid X, \theta, \eta)$ is obtained by integrating over the noiseless function outputs, which are treated as latent variables since we can't directly observe them. I won't derive it here for space reasons. The steps are similar to the derivation for ordinary GP regression (see Rasmussen & Williams chapter 2), but with the noise variance modified as above. The marginal likelihood turns out to be Gaussian with mean zero and covariance matrix $C_y$:

$$\log p(y \mid X, \theta, \eta) = -\frac{n}{2} \log(2 \pi) - \frac{1}{2} \log \det(\Sigma_y) - \frac{1}{2} y^T \Sigma_y^{-1} y$$

$$C_y = K(X,X) + \text{diag}(\sigma_1^2, \dots, \sigma_n^2)$$

I use the notation $K(\cdot, \cdot)$ to denote a matrix obtained by evaluating the covariance function for all pairs of elements in two sets of points. So entry $(i,j)$ of $K(A, B)$ is equal to $k(a_i, b_j)$. Also, recall that the noise variances are a function of the confidence scores, so:

$$\text{diag}(\sigma_1^2, \dots, \sigma_n^2) = \text{diag} \Big( \exp g(c_1; \eta), \dots, \exp g(c_n; \eta) \Big)$$

Predictive distribution

The posterior predictive distribution is similar to standard GP regression (again, see Rasmussen & Williams for the steps involved in the derivation). Suppose we've fit the model to the training data. Given $\tilde{n}$ new input points in $\tilde{X} = \{\tilde{x}_1, \dots, \tilde{x}_{\tilde{n}}\}$, we'd like to predict the corresponding noiseless function outputs $\tilde{f} = [f(\tilde{x_1}), \dots, f(\tilde{x_n})]^T$. These have a joint Gaussian posterior distribution:

$$p(\tilde{f} \mid \tilde{X}, X, y, \theta, \eta) \ = \ \mathcal{N}(\tilde{f} \mid \mu_{\tilde{f}}, C_{\tilde{f}})$$

with mean and covariance matrix:

$$\mu_{\tilde{f}} = K(\tilde{X},X) C_y^{-1} y$$

$$C_{\tilde{f}} = K(\tilde{X},\tilde{X}) - K(\tilde{X}, X) C_y^{-1} K(X, \tilde{X})$$

The new noisy outputs $\tilde{y} = [\tilde{y}_1, \dots, \tilde{y}_{\tilde{n}}]^T$ are assumed to be produced by adding Gaussian noise to the corresponding function outputs. So, they also have a joint Gaussian posterior distribution, with mean and covariance matrix:

$$\mu_\tilde{y} = \mu_{\tilde{f}}$$

$$C_{\tilde{y}} = C_{\tilde{f}} + \text{diag}(\tilde{\sigma}^2_1, \dots, \tilde{\sigma}^2_{\tilde{n}})$$

Note that computing the posterior predictive distribution for $\tilde{y}$ requires confidence scores for the new points in order to compute the new noise variances $\tilde{\sigma}^2_i$. But, confidence scores aren't required for the predictive distribution of $\tilde{f}$.

Example

Data generation

I generated 500 data points with the true function $f(x) = \sin(2 \pi x)$ and noise variance increasing quadratically with $x$. I generated a confidence score for each data point (ranging from 0 to 100) such that noise variance was a decreasing cubic function of the confidence score. In this example, I chose the noise variance and confidence scores to vary with $x$ for ease of visualization. But, note that the approach described above only requires that the noise variance be a function of the confidence score; no dependence on $x$ is required. Here's the data:

enter image description here

Standard GP regression

Standard GP regression (using the squared exponential covariance function) can capture the conditional mean well. But, the predictive distribution isn't a good fit to the data because the assumption of constant noise variance doesn't hold.

enter image description here

New model

I then modeled the data as described above, using the squared exponential covariance function. For the confidence function, I used a piecewise linear function (spline) with 10 fixed knot points spread evenly over the range of the confidence scores. The parameter vector $\eta$ of the confidence function contains the log noise variances to output at the knot points. Adjusting the parameters simply moves the confidence function up and down at these points. Given any confidence score, the log noise variance is obtained by linearly interpolating between the knot points. When optimizing the marginal likelihood, I initialized $\eta$ such that the noise variance would be constant, and equal to 1/10th of the overall sample variance of the training outputs.

The model captures both the conditional mean and variance of the noisy outputs, and the estimated confidence function approximates the true confidence function reasonably well:

enter image description here

References