Nonparametric Models – Understanding Why Gaussian Process Models Are Non-Parametric

gaussian processnonparametric

I am a bit confused. Why are Gaussian processes called non parametric models?

They do assume that the functional values, or a subset of them, have a Gaussian prior with mean 0 and covariance function given as the kernel function. These kernel functions themselves have some parameters (i.e., hyperparameters).

So why are they called non parametric models?

Best Answer

I'll preface this by saying that it isn't always clear what one means by "nonparametric" or "semiparametric" etc. In the comments, it seems likely that whuber has some formal definition in mind (maybe something like choosing a model $M_\theta$ from some family $\{M_\theta: \theta \in \Theta\}$ where $\Theta$ is infinite dimensional), but I'm going to be pretty informal. Some might argue that a nonparametric method is one where the effective number of parameters you use increases with the data. I think there is a video on videolectures.net where (I think) Peter Orbanz gives four or five different takes on how we can define "nonparametric."

Since I think I know what sorts of things you have in mind, for simplicity I'll assume that you are talking about using Gaussian processes for regression, in a typical way: we have training data $(Y_i, X_i), i = 1, ..., n$ and we are interested in modeling the conditional mean $E(Y|X = x) := f(x)$. We write $$ Y_i = f(X_i) + \epsilon_i $$ and perhaps we are so bold as to assume that the $\epsilon_i$ are iid and normally distributed, $\epsilon_i \sim N(0, \sigma^2)$. $X_i$ will be one dimensional, but everything carries over to higher dimensions.

If our $X_i$ can take values in a continuum then $f(\cdot)$ can be thought of as a parameter of (uncountably) infinite dimension. So, in the sense that we are estimating a parameter of infinite dimension, our problem is a nonparametric one. It is true that the Bayesian approach has some parameters floating about here and there. But really, it is called nonparametric because we are estimating something of infinite dimension. The GP priors we use assign mass to every neighborhood of every continuous function, so they can estimate any continuous function arbitrarily well.

The things in the covariance function are playing a role similar to the smoothing parameters in the usual frequentist estimators - in order for the problem to not be absolutely hopeless we have to assume that there is some structure that we expect to see $f$ exhibit. Bayesians accomplish this by using a prior on the space of continuous functions in the form of a Gaussian process. From a Bayesian perspective, we are encoding beliefs about $f$ by assuming $f$ is drawn from a GP with such-and-such covariance function. The prior effectively penalizes estimates of $f$ for being too complicated.

Edit for computational issues

Most (all?) of this stuff is in the Gaussian Process book by Rasmussen and Williams.

Computational issues are tricky for GPs. If we proceed niavely we will need $O(N^2)$ size memory just to hold the covariance matrix and (it turns out) $O(N^3)$ operations to invert it. There are a few things we can do to make things more feasible. One option is to note that guy that we really need is $v$, the solution to $(K + \sigma^2 I)v = Y$ where $K$ is the covariance matrix. The method of conjugate gradients solves this exactly in $O(N^3)$ computations, but if we satisfy ourselves with an approximate solution we could terminate the conjugate gradient algorithm after $k$ steps and do it in $O(kN^2)$ computations. We also don't necessarily need to store the whole matrix $K$ at once.

So we've moved from $O(N^3)$ to $O(kN^2)$, but this still scales quadratically in $N$, so we might not be happy. The next best thing is to work instead with a subset of the data, say of size $m$ where inverting and storing an $m \times m$ matrix isn't so bad. Of course, we don't want to just throw away the remaining data. The subset of regressors approach notes that we can derive the posterior mean of our GP as a regression of our data $Y$ on $N$ data-dependent basis functions determined by our covariance function; so we throw all but $m$ of these away and we are down to $O(m^2 N)$ computations.

A couple of other potential options exist. We could construct a low-rank approximation to $K$, and set $K = QQ^T$ where $Q$ is $n \times q$ and of rank $q$; it turns inverting $K + \sigma^2 I$ in this case can be done by instead inverting $Q^TQ + \sigma^2 I$. Another option is to choose the covariance function to be sparse and use conjugate gradient methods - if the covariance matrix is very sparse then this can speed up computations substantially.

Related Question