A Gaussian Process doesn't have to perfectly interpolate between points, as that Wikipedia link shows; it all depends on the covariance function that you use.
For example, consider the GP of the form $X \sim \mathcal N(0, \Sigma_{k_t})$, where $X$ is a vector of a "dependant variables", and $\Sigma_{k_t}$ is a covariance matrix, where every element $\Sigma_{ij} = k(t_i, t_j)$ for some kernel function $k$, and a set of points of the "independent variable" $t$.
If you specify a kernel with the following property: $Cor(x_i, x_j) \to 1$ as $||t_i - t_j|| \to 0$, notice that you are enforcing continuity. Hence, if you simply use such a kernel, for example, the RBF, it must pass through all the points as there's no "noise" here at all.
Instead, if you decide to specify a kernel that does account for noise, for example: $k(t_i, t_j) = RBF(t_i, t_j) + \sigma^2 \mathcal I(t_i =t_j)$ (the WhiteKernel
in scikit-learn, also known as the White Noise kernel), then notice that, even if the two $t$s are close, their correlation isn't 1, i.e. there's some noise here. So the function is not expected to be continuous.
In fact, you can interpret using such a kernel as the traditional smooth RBF GP but with a noise term added on top:
$$X \sim \mathcal N(0, \Sigma_{RBF} + \sigma^2 \mathcal I) $$
$$\stackrel d= \mathcal N(0, \Sigma_{RBF}) + \mathcal N(0, \sigma^2 \mathcal I) $$
$$\Rightarrow X = \bar X +\epsilon$$
... where $\bar X$ is now a continuous GP. Notice how similar this is to the linear regression equation - the only difference really is that you're replacing the mean of the linear regression (which is a parametric line) to a non-parametric GP.
Best Answer
I believe you mean Gaussian processes rather than Bayesian optimisation. Bayesian optimisation is the use of Gaussian processes for global optimisation. Essentially you use the mean and variance of your posterior Gaussian process to balance the exploration and exploitation trade off in global optimisation (i.e. You want to fin the highest local point but you don't want to fall into local extrema).
Gaussian processes have been around since the 60s as far as I'm aware and maybe even earlier than that. As such they have been used and modified in different fields. In geostatistics, which at one stage was dominated by the French research community, they became 'kriging' and many scaling approximations and kernels were derived specifically for geo-spatial low dimensional data.
In statistics and later machine learning they remained being referred to as Gaussian processes. Unfortunately this occasionally lead to people reinventing the wheel when it came to approximate approaches and theoretical analysis.
Now it's not uncommon that people use the terms interchangeably. For example Tom Nickson created a kronecker product sparse variation along approach Gaussian Processes which he called Blitz-Kriging.