Solved – Visualizing model fit for multidimensional data

data visualizationgaussian process

I am trying to use Gaussian Processes for fitting smooth functions to some datapoints. I am using scikit-learn library for python and in my case my input are two dimensional spatial coordinates and the output are some transformed version and also 2-D spatial coordinates. I generated some dummy test data and tried to fit a GP model to it. The code that I used was as follows:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
import numpy as np

# Some dummy data
X = np.random.rand(10, 2)
Y = np.sin(X)

# Use the squared exponential kernel
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, Y)
print(X)
# Evaluate on a test point
test = np.random.rand(1, 2)
test[:, 0] = 1.56
test[:, 1] = 0.92
y_pred, sigma = gp.predict(test, return_std=True)
print(test, np.sin(test))  # The true value
print(y_pred, sigma)  # The predicted value and the STD

I was wondering if there is a good way to visualize the model fit. As my input and output dimensions are both 2-D, I am not sure how I can visualize it quickly so that I get an idea of the model fit (particularly want to know the smoothness and variance of the model prediction between the points). Most examples online are, of course, for 1-D case.

Best Answer

I think a good approach in your case could be to

Fit the multivariate GP model on a few training points, as you do now
Take advantage of the fact you have the ground truth function in order to generate true values and predicted values for a range of inputs.
Plot comparisons of the "marginal" and "joint" outputs for these ranges of values.

Preparing 2-D inputs as a Matlab-style meshgrid:

delta = 0.025
x = np.arange(-1, +1, delta)
y = np.arange(-1, +1, delta)
X, Y = np.meshgrid(x, y)

Generating predictions from the fitted GP model for all the combinations of 2-D X inputs, and then separating the 2-D outputs into individual arrays for later use:

test = np.stack([np.ravel(X), np.ravel(Y)], axis=1)
y_pred, sigma = gp.predict(test, return_std=True)
y_pred_fromX = y_pred[:,0].reshape(X.shape)
y_pred_fromY = y_pred[:,1].reshape(X.shape)

For the first dimension of the 2-D output, we plot the actual & predicted values as contours, with the axes representing the 2-D inputs:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.subplot(121)
plt.contour(X, Y, np.sin(X), 20)
plt.title('1st dim: True')
plt.subplot(122)
plt.contour(X, Y, y_pred_fromX, 20)
plt.title('1st dim: Predicted')

Same for the second dimension of the 2-D output:

plt.figure(figsize=(10,5))
plt.subplot(121)
plt.contour(X, Y, np.sin(Y), 20)
plt.title('2nd dim: True')
plt.subplot(122)
plt.contour(X, Y, y_pred_fromY, 20)
plt.title('2nd dim: Predicted')

Focussing on the 2-D output alone, scatterplots of joint occurences are not particularly helpful. Here the axes are the 2-D output values:

plt.figure(figsize=(10,5))
plt.subplot(121)
plt.scatter(np.sin(X), np.sin(Y))
plt.title('True: scatterplot')
plt.subplot(122)
plt.scatter(y_pred_fromX, y_pred_fromY)
plt.title('Predicted: scatterplot')

But Seaborn's jointplots are much more useful. Once again, axes are 2-D output values, and the plot represents a calculated density:

import seaborn as sns

plt.figure()
sns.jointplot(x=np.sin(X), y=np.sin(Y), kind='kde')
plt.title('True: jointplot')
plt.figure()
sns.jointplot(x=y_pred_fromX, y=y_pred_fromY, kind='kde')
plt.title('Predicted: jointplot')

Related Solutions

Solved – Why the RMSE of training is very small but the test error is very big

I think the problem that you get is an Overfitting in the model which you created.

When you are creating a predictive model, what actually you are doing is create the model that captures the signal not the noise of the data.

RMSE of training of model is a metric which measure how much the signal and the noise is explained by the model

So when you add more variables to a model, the model become more "flexible", it captures the pattern of training data very well and reduces RMSE to a smallest amount, however because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f which you are trying to estimate . When you overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data

I recommend you read Quora answer by William Chen in this link, he explained it very well in a layman term : https://www.quora.com/What-is-an-intuitive-explanation-of-over-fitting-particularly-with-a-small-sample-set-What-are-you-essentially-doing-by-over-fitting-How-does-the-over-promise-of-a-high-R%C2%B2-low-standard-error-occur

After you reading it, you can find the free-ebook An Introduction to Statistical Learning with Application in R by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.

edit: link to the ebook http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf

Read the chapter 2 , 2.1 and 2.2 , they explained it very detail with the awesome illustration.

I hope this help you.

R – How to Sample from Gaussian Process Across 2D

Found an analytical solution to this. Turns out all of the action is in the kernel; not in the sampling from the multivariate normal. The key is to compute a pairwise distance grid across the two input dimensions. This takes it from an n x n problem to an n^2 x n^2 problem. See code below.

library("ggplot2")
library("plgp")

# kernel function
rbf_D <- function(X,l=1, eps = sqrt(.Machine$double.eps) ){
  D <- plgp::distance(X)
  Sigma <- exp(-D/l)^2 + diag(eps, nrow(X))
}
# number of samples
nx <- 30
x <- seq(0,2,length=nx)
# grid of pairwise values
X <- expand.grid(x, x)
# compute squared exponential kernel on pairwise values
Sigma <- rbf_D(X,l=2)

# sample from multivariate normal with mean zero, sigma = sigma
Y <- MASS::mvrnorm(1,rep(0,dim(Sigma)[1]), Sigma)

# plot results
pp <- data.frame(y=Y,x1=X[,1],x2=X[,2])
ggplot(pp,aes(x=x1,y=x2)) +
  geom_raster(aes(fill=y), interpolate = TRUE) +
  geom_contour(aes(z=y), bins = 12, color = "gray30", 
               size = 0.5, alpha = 0.5) +
  coord_equal() +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_fill_viridis_c(option = "viridis")

based on code from Robert Gramacy.

Best Answer

Related Solutions

Solved – Why the RMSE of training is very small but the test error is very big

R – How to Sample from Gaussian Process Across 2D

Related Question