Rewrite the numerator
\begin{equation}
\sum_i(y_i-\bar y)(x_i-\bar x)
=
\sum_i(y_i-\frac{1}{n}\sum_j y_j)(x_i-\frac{1}{n}\sum_j x_j)
\end{equation}
that is
\begin{equation}
\sum_i(y_i-\bar y)(x_i-\bar x)
=
\sum_i(\sum_j \frac{1}{n} y_i-\frac{1}{n}\sum_j y_j)(\sum_j \frac{1}{n}x_i-\frac{1}{n}\sum_j x_j)
\end{equation}
or
\begin{equation}
\sum_i(y_i-\bar y)(x_i-\bar x)
=
\sum_i\sum_j( \frac{1}{n} y_i-\frac{1}{n} y_j)(\frac{1}{n}x_i-\frac{1}{n} x_j)
=
\frac{1}{n}
\sum_{i,\,j}(y_i-y_j)(x_i-x_j)
\end{equation}
Rewrite the denominator
\begin{equation}
\sum_i(x_i-\bar x)^2
=
\sum_i(x_i-\bar x)(x_i-\bar x)
=
\sum_i(x_i-\frac{1}{n}\sum_j x_j )(x_i-\frac{1}{n}\sum_j x_j )
\end{equation}
that is
\begin{equation}
\sum_i(x_i-\bar x)^2
=
\sum_i\sum_j(\frac{1}{n}x_i-\frac{1}{n}x_j )(\frac{1}{n}x_i-\frac{1}{n} x_j )
=
\frac{1}{n}
\sum_i\sum_j(x_i-x_j )^2
\end{equation}
Replace now
So
\begin{equation}
\frac{\sum_i(y_i-\bar y)(x_i-\bar x)}{\sum_i(x_i-\bar x)^2}
=
\frac{\frac{1}{n}
\sum_{i,\,j}(y_i-y_j)(x_i-x_j)}{\frac{1}{n}
\sum_i\sum_j(x_i-x_j )^2}
=
\frac{\sum_{i,\,j}(y_i-y_j)(x_i-x_j)}{\sum_{i,\,j}(x_i-x_j)^2}
\end{equation}
In what follows, we work with matrix-vector products, which means we multiply column vectors on the left by matrices (thus differing from the reference).
Let S = $\{ (x_{i}, y_{i})\}_{i=1}^{n}$ denote the collection of data, where $x_{i} \in \mathbb{R}^{p}$ and $y_{i} \in \mathbb{R}$ for $i=1,\dots,n.$ In the first case, we consider a mean-centered linear model:
$$y_{i} = a_{0} + \sum_{j=1}^{p} a_{j} (x_{ij} - \bar{x}_{j}) + \epsilon_{i} \qquad for \quad i=1\dots,n,$$
where $\bar{x}_{j} = \frac{\sum_{i=1}^{n} x_{ij}}{n}$ is the center of mass for the $j$th feature.
Now, we express it in matrix notation (for convenience). Namely, let
$$X-\bar{X} =
\begin{pmatrix}
1 & x_{11} & \cdots & x_{1p} \\
1 & x_{21} & \cdots & x_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & \cdots & x_{np} \\
\end{pmatrix}
-
\begin{pmatrix}
0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\
0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\
\vdots & \vdots & \ddots & \vdots \\
0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\
\end{pmatrix},
$$
$\epsilon =
\begin{pmatrix}
\epsilon_{1} & \epsilon_{2} & \cdots & \epsilon_{n}
\end{pmatrix}^{t},
$
and
$Y =
\begin{pmatrix}
y_{1} & y_{2} & \cdots & y_{n}
\end{pmatrix}^{t}.
$
Then, we have $Y = (X - \bar{X})a + \epsilon,$ where $a =
\begin{pmatrix}
a_{0} & a_{1} & \cdots & a_{p}
\end{pmatrix}^{t},
$
and the right hand side simplifies to $Xa - \bar{X}a + \epsilon.$
In the second case, we consider a linear model:
$$y_{i} = b_{0} + \sum_{j=1}^{p} b_{j} x_{ij} + \epsilon_{i} \qquad for \quad i=1\dots,n.$$
Subsequently, we express it in matrix notation such that $Y = Xb + \epsilon,$ where $X$ and $\epsilon$ are as above and $b$ must be deduced. To find the elements of $b,$ we rearrange the right hand side of the mean-centered linear model to obtain:
$$y_{i} = a_{0} - \sum_{j=1}^{p} a_{j} \bar{x}_{j} + \sum_{j=1}^{p}a_{j}x_{ij} + \epsilon_{i} \qquad for \quad i=1\dots,n.$$
This suggests $b_{0} = a_{0} - \sum_{j=1}^{p} a_{j} \bar{x}_{j}$ and $b_{j} = a_{j}$ for $j=1,\dots,p,$ so we have found all of the elements in $b.$ Hence, we observe that both linear models are identical. (From here, derive the estimates.)
A note of caution: in (1.3) of the reference, they do not apply the inverse, whereas your expression contains the inverse. In your case, you have assumed invertibility, which is equivalent to the $p$ features being linearly independent.
Best Answer
If we define the following vector/matrix variables $$\eqalign{ r &= X\beta - y \qquad&({\rm residual/error\,vector}) \\ R &= {\rm Diag}(r) \qquad&({\rm as\,a\,diagonal\,matrix}) \\ W &= {\rm Diag}(w) \qquad&({\rm weights\,as\,a\,matrix}) \\ }$$ then the objective function can be written without explicit summations as $$\eqalign{ \phi &= W:R^2 \\ }$$ where the colon is a convenient product notation for the trace, i.e. $$A:B = {\rm Tr}(A^TB) = {\rm Tr}(B^TA) = B:A$$ Calculate the gradient of the function. $$\eqalign{ d\phi &= W:(2R\,dR) \\ &= 2WR:dR \\ &= 2\,{\rm diag}(WR):{\rm diag}(dR) \\ &= 2Wr:dr \\ &= 2W(X\beta - y):(X\,d\beta) \\ &= 2X^TW(X\beta - y):d\beta \\ \frac{\partial\phi}{\partial\beta} &= 2X^TW(X\beta - y) \\ }$$ Set the gradient to zero and solve for the optimal $\beta$. $$\eqalign{ X^TWX\beta &= X^TWy \\ \beta &= (X^TWX)^{-1}X^TWy \\ }$$ This confirms the Wikipedia result.