Solved – Learning of hyperparameters for Gaussian process

bayesiangaussian processhyperparameter

Following the paper Practical Bayesian Optimization of Machine
Learning Algorithms
. Link

It's not clear to me as to how the hyper-parameters (different from the target hyperparameters for some other method) for the Gaussian Process (GP) is been learned.

The paper mentions it as follows (Pg.4 under section 3.1):

After choosing the form of the covariance, we must also manage the
hyperparameters that govern its behavior (Note that these
“hyperparameters” are distinct from those being subjected to the
overall Bayesian optimization.), as well as that of the mean function.
For our problems of interest, typically we would have $D + 3$ Gaussian
process hyperparameters: $D$ length scales $\theta_{1}:D$, the covariance
amplitude $\theta$, the observation noise ν, and a constant mean m.
The most commonly advocated approach is to use a point estimate of
these parameters by optimizing the marginal likelihood under the
Gaussian process, $p(y|\{x_n\}^N_{ n=1}, \theta, ν, m) = N (y | m1, \sum_{\theta} + νI)$,
where $y = [y1, y2, · · · , y_{N}]^{T}$, and
$\sum_{\theta}$ is the covariance matrix resulting from the N input
points under the hyperparameters $\theta$.

Can anyone elaborate or explain in layman terms on the suggested approach of using point estimate ( and how it is done practically) mentioned in the above description?

Best Answer

They claim to, "use a point estimate of these parameters by optimizing the marginal likelihood under the Gaussian process."

Now the expression given indicates they believe the marginal likelihood has a Gaussian distribution $N(y|m1,\Sigma_\theta+\nu I)$. At this point you can write out the marginal likelihood (usually on the log scale) as it has the fixed form implied by the Gaussian and then optimise with respect to the unknown parameters.

The reason people use a point estimate in these circumstances is almost always the speed and tractability of calculating the estimate. The downside is that you lose notions of uncertainty about the parameters. E.g. if say $\nu$ has a shallow maximum you might underestimate the variability of some other quantities of interest.

Related Question