Solved – gaussian process regression for large datasets

gaussian processinferencemachine learningmultivariate regressionprobability

I've been learning about Gaussian process regression from online videos and lecture notes, my understanding of it is that if we have a dataset with $n$ points then we assume the data is sampled from an $n$-dimensional multivariate Gaussian. So my question is in the case where $n$ is 10's of millions does Gaussian process regression still work? Will the kernel matrix not be huge rendering the process completely inefficient? If so are there techniques in place to deal with this, like sampling from the data set repeatedly many times? What are some good methods for dealing with such cases?

Best Answer

There are a wide range of approaches to scale GPs to large datasets, for example:

Low Rank Approaches: these endeavoring to create a low rank approximation to the covariance matrix. Most famously perhaps is Nystroms method which projects the data onto a subset of points. Building on from that FITC and PITC were developed which use pseudo points rather than points observed. These are included in for example the GPy python library. Other approaches include random Fourier features.

H-matrices: these use hierarchical structuring of the covariance matrix and apply low rank approximations to each structures submatrix. This is less commonly implemented in popular libraries.

Kronecker Methods: these use Kronecker products of covariance matrices in order to speed up the computational over head bottleneck.

Bayesian Committee Machines: This involves splitting your data into subsets and modeling each one with a GP. Then you can combine the predictions using the optimal Bayesian combination of the outputs. This is quite easy to implement yourself and is fast but kind of breaks your kernel is you care about that - Mark Deisenroth’s paper should be easy enough to follow here.

Related Question