The OP's Eq. $(1)$ is the slope of the regression line if we have $N$ pairs $(x_i, y_i)$ of real numbers. and ask what is the "best" straight line that fits these $N$ data points. In general, it is not the slope of the regression line when we have a pair of random variables $(X, Y)$ and ask what is the random variable $\hat{Y} = \alpha + \beta X$ such that $E[(Y-\hat{Y})^2]$ is as small as possible. The answer to the latter question is indeed that $\beta$ must have value $\frac{\operatorname{cov}(X,Y)}{\operatorname{var}(X)}$ as the OP states in $(2)$ but this result applies to all random variables with finite variances, not just discrete random variables. Indeed, if $X$ and $Y$ are discrete random variables taking on values $x_1, x_2, \ldots, x_M$ and $y_1,y_2,\ldots, y_N$ respectively, then the covariance $\operatorname{cov}(X,Y)$ is given by
\begin{align}\operatorname{cov}(X,Y) &= \sum_{m=1}^M \sum_{n=1}^N P(X=x_m, Y = y_n)(x_m-\bar{x})(y_n-\bar{y})\\&= \sum_{m=1}^M \sum_{n=1}^N p_{X,Y}(x_m, y_n)(x_m-\bar{x})(y_n-\bar{y})\end{align}
where $\bar{x}$ and $\bar{y}$ are the means $E[X]$ and $E[Y]$ respectively and $p_{X,Y}(x_m, y_n)$ is the joint probability mass function (joint pmf) of $(X,Y)$. This is a slightly more general version of the numerator of $(4)$ in the OP's question. As the OP correctly asserts, if $M=N$ and the joint pmf has value $\frac 1N$ for exactly $N$ points $(x_i,y_i)$, then it is indeed the case that $\operatorname{cov}(X,Y)$ is (proportional to) the numerator of $(1)$ in the OP's question.
The key is the assumption of additive independent identically distributed Gaussian noise $\epsilon_n$, i.e. the assumption that observations are given by $\textbf{y} = \textbf{f} + \epsilon_{n}$ where $\epsilon_n \sim \mathcal{N}(0,\sigma_{n}^{2}I)$ independent from $\textbf{f}$. It should be intuitively clear that if you know the noise-free value $\textbf{f}$ then you should expect the observation $\textbf{y}$ to be a Gaussian centered on $\textbf{f}$ and with the covariance matrix of the noise.
We can show this more rigorously by deriving the conditional distribution $\textbf{y|f}$. First, note the distribution of $\textbf{f}$ and of $\epsilon_n$:
$$
\textbf{f} \sim \mathcal{N}(0, K) \\
\epsilon_n \sim \mathcal{N}(0, \sigma_n^2 I).
$$
We need to know the joint distribution of $\textbf{f}$ and $\textbf{y}$, so we will calculate the covariance matrices of $\textbf{y}$ and of $\textbf{y}$ with $\textbf{f}$. Since $\textbf{y} = \textbf{f} + \epsilon_n$ we have
$$
\textbf{y} \sim \mathcal{N}(0, K + \sigma_n^2 I)
$$
by the properties of idependent Gaussian distributions. Also,
$$
\mathrm{Cov}(\textbf{y}, \textbf{f}) = \mathrm{Cov}(\textbf{f}, \textbf{f}) + \mathrm{Cov}(\epsilon_n, \textbf{f}) = K + 0 = K.
$$
Collect the results above into a statement of the joint distribution of $\textbf{y}$ and $\textbf{f}$
$$
\begin{bmatrix}
\textbf{y} \\
\textbf{f}
\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}
0 \\
0
\end{bmatrix}, \begin{bmatrix}
K + \sigma_n^2 I & K \\
K & K
\end{bmatrix}
\right)
$$
where in the bottom left corner we used the fact that $K = K^T$.
Finally, we find the conditional distribution $\textbf{y|f}$ using identity (A.6) on page 200 in appendix A.2 in Rasmussen (also to be found here)
$$
\textbf{y} | \textbf{f} \sim \mathcal{N}(0 + K K^{-1}(\textbf{f} - 0), K + \sigma_n^2 I - KK^{-1}K^T) \\
\textbf{y} | \textbf{f} \sim \mathcal{N}(\textbf{f}, \sigma_n^2 I)
$$
as expected.
Best Answer
Gaussian Process Regression and Kriging are the same method. The Wikipedia article on "Kriging" you linked to ecen mentions both names in the introduction.