Solved – Specifying separable covariance functions for 2D gaussian process regression

gaussian processmachine learningregressionscikit learn

I would like to fit a gaussian process regression with two input variables. But I am not sure how to construct or interpret the covariance function with multiple input dimensions.

There are different covariance functions (called kernels in sklearn) that you can choose based on the expected relationship, e.g. you might choose a cyclical covariance function for day of the year predicting temperature as temperature goes up in summer then down in winter and starts again.

If we have multiple inputs and we expect them to have different relationships with the response variable how do we include this information in the covariance structure? Do we need to?

Toy example

Two input variables x and w are used to predict y. y follows a sine relationship with x whereas it follows a quadratic relationship with w. There is noise, denoted e.

n = 500
w = pd.Series((np.random.random(n)*5)-2.5)
x = pd.Series(20*np.random.random(n))
e = pd.Series(np.random.normal(0,0.5,n))

y = pd.Series(w**2 + np.sin(x) + e)

data = pd.concat([w,x,e,y], axis=1)
data.columns=["w","x","e","y"]

For a 1D case with x I would use the ExpSineSquared() kernel in scikit-learn and the Squared Exponential kernel RBF() for w. This Paper says that to extend covariance functions to multiple inputs:

take a covariance function that is the product of one-dimensional covariances over each input.

So I believe I would create a covariance functions like so:

kernel = RBF() * ExpSineSquared() + WhiteKernel()

Hopefully RBF() is looking for a smooth function of y with w, ExpSineSquared() is looking for a periodic function with x, and WhiteKernel() is just their capturing unexplained variance. But as I haven't linked them to specific inputs is it looking for both functions across both inputs?

References:

1) Gaussian Processes for Timeseries Modelling, S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson & S. Aigrain
.

Best Answer

If we have multiple inputs and we expect them to have different relationships with the response variable how do we include this information in the covariance structure?

A general strategy is to 1) specify multiple covariance functions, where each depends on a subset of the inputs, then 2) combine these into a single covariance function. See below for details. In your example, you combined multiple covariance functions, but each treats all input features symmetrically. So, this won't give the behavior you're looking for.

Do we need to?

If you have prior knowledge that different inputs contribute to the output in different ways, then incorporating this information into the structure of the covariance function should make learning more efficient. Without doing this, it may be possible to succeed anyway, for certain covariance functions. For example, the RBF kernel allows universal function approximation. But, this may require much more data.

Defining covariance functions for separate features

Commonly used covariance functions often depend on all input features. For example, the periodic (exp-sine-squared) kernel is radially symmetric, as you can see on the left. This is a plot of the covariance function evaluated between each point and the point $[0,0]$. What you want is for the covariance function to depend on individual input features (and therefore be invariant to the others), as on the right.

enter image description here

One way to do this is simply to evaluate the covariance function on a single input feature (or subset of features). This is how I produced the plot on the right. I don't believe scikit-learn provides a way to do this in the context of Gaussian process regression, but it's perfectly valid.

Another approach is to scale the input features separately. Some covariance functions have length scale parameters that can be set individually for each feature. Setting it to infinity (or a very large value) makes the covariance function effectively invariant to the corresponding feature. For example, scikit-learn allows this with the squared exponential covariance function, but it isn't implemented yet for exp-sin-squared.

Combining covariance functions

Once we've defined multiple covariance functions that depend differently on each feature, we must combine them into a single covariance function. Multiplication and addition are two common ways to do this. This page gives a good overview.

Multiplying two covariance functions gives $k(\cdot,\cdot) = k_1(\cdot,\cdot) k_2(\cdot,\cdot)$. In this case, $k$ will only take high values when both $k_1$ and $k_2$ take high values. Suppose we have two input features, and that $k_1$ depends only on feature $x$ and $k_2$ depends only on feature $y$. Then $k$ encodes the assumption that the function values at two points $(x,y)$ and $(x',y')$ will only be similar when $x$ is similar to $x'$ (according to $k_1$) and $y$ is similar to $y'$ (according to $k_2$).

Adding two covariance functions gives $k(\cdot,\cdot) = k_1(\cdot,\cdot) + k_2(\cdot,\cdot)$. In this case, $k$ will take high values when either $k_1$ or $k_2$ (or both) take high values. If $k_1$ and $k_2$ each depend on a single feature (as above), then a Gaussian process using the combined kernel $k$ (and constant mean function) is a distribution over additive functions of the individual features. That is: $f(x,y) = f_X(x) + f_Y(y)$. Your example function has this structure, so adding kernels would be a good choice there.

Example

I defined two covariance functions: a periodic function depending only on $x$, and a squared exponential function depending only on $y$. I combined them by either multiplying or adding. I then drew a sample function $f(x,y)$ from the resulting Gaussian process (with zero mean function). Multiplication is shown on the left and addition on the right:

enter image description here

On the right, you can see that the function decomposes into additive functions of $x$ and $y$, as mentioned above. In contrast, there are local interaction effects on the left.

Related Question