We can prove this for more general case of $p$ variables by using the "hat matrix" and some of its useful properties. These results are usually much harder to state in non matrix terms because of the use of the spectral decomposition.
Now in matrix version of least squares, the hat matrix is $H=X(X^TX)^{-1}X^T$ where $X$ has $n$ rows and $p+1$ columns (column of ones for $\beta_0$). Assume full column rank for convenience - else you could replace $p+1$ by the column rank of $X$ in the following. We can write the fitted values as $\hat{Y}_i=\sum_{j=1}^nH_{ij}Y_j$ or in matrix notation $\hat{Y}=HY$. Using this, we can write the sum of squares as:
$$\frac{\sum_{i=1}(Y-\hat{Y_i})^2}{\sigma^2}=\frac{(Y-\hat{Y})^T(Y-\hat{Y})}{\sigma^2}=\frac{(Y-HY)^T(Y-HY)}{\sigma^2}$$
$$=\frac{Y^T(I_n-H)Y}{\sigma^2}$$
Where $I_n$ is an identity matrix of order $n$. The last step follows from the fact that $H$ is an idepotent matrix, as $$H^2=[X(X^TX)^{-1}X^T][X(X^TX)^{-1}X^T]=X(X^TX)^{-1}X^T=H=HH^T=H^TH$$
Now a neat property of idepotent matrices is that all of their eigenvalues must be equal to zero or one. Letting $e$ denote a normalised eigenvector of $H$ with eigenvalue $l$, we can prove this as follows:
$$He=le\implies H(He)=H(le)$$
$$LHS=H^2e=He=le\;\;\; RHS=lHe=l^2e$$
$$\implies le=l^2e\implies l=0\text{ or }1$$
(note that $e$ cannot be zero as it must satisfy $e^Te=1$) Now because $H$ is idepotent, $I_n-H$ also is, because
$$(I_n-H)(I_n-H)=I-IH-HI+H^2=I_n-H$$
We also have the property that the sum of the eigenvalues equals the trace of the matrix, and
$$tr(I_n-H)=tr(I_n)-tr(H)=n-tr(X(X^TX)^{-1}X^T)=n-tr((X^TX)^{-1}X^TX)$$
$$=n-tr(I_{p+1})=n-p-1$$
Hence $I-H$ must have $n-p-1$ eigenvalues equal to $1$ and $p+1$ eigenvalues equal to $0$.
Now we can use the spectral decomposition of $I-H=ADA^T$ where $D=\begin{pmatrix}I_{n-p-1} & 0_{[n-p-1]\times[p+1]}\\0_{[p+1]\times [n-p-1]} & 0_{[p+1]\times [p+1]}\end{pmatrix}$ and $A$ is orthogonal (because $I-H$ is symmetric) . A further property which is useful is that $HX=X$. This helps narrow down the $A$ matrix
$$HX=X\implies(I-H)X=0\implies ADA^TX=0\implies DA^TX=0$$
$$\implies (A^TX)_{ij}=0\;\;\;i=1,\dots,n-p-1\;\;\; j=1,\dots,p+1$$
and we get:
$$\frac{\sum_{i=1}(Y-\hat{Y_i})^2}{\sigma^2}=\frac{Y^TADA^TY}{\sigma^2}=\frac{\sum_{i=1}^{n-p-1}(A^TY)_i^2}{\sigma^2}$$
Now, under the model we have $Y\sim N(X\beta,\sigma^2I)$ and using standard normal theory we have $A^TY\sim N(A^TX\beta,\sigma^2A^TA)\sim N(A^TX\beta,\sigma^2I)$ showing that the components of $A^TY$ are independent. Now using the useful result, we have that $(A^TY)_i\sim N(0,\sigma^2)$ for $i=1,\dots,n-p-1$. The chi-square distribution with $n-p-1$ degrees of freedom for the sum of squared errors follows immediately.
The derivation in matrix notation
Starting from $y= Xb +\epsilon $, which really is just the same as
$\begin{bmatrix}
y_{1} \\
y_{2} \\
\vdots \\
y_{N}
\end{bmatrix}
=
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1K} \\
x_{21} & x_{22} & \cdots & x_{2K} \\
\vdots & \ddots & \ddots & \vdots \\
x_{N1} & x_{N2} & \cdots & x_{NK}
\end{bmatrix}
*
\begin{bmatrix}
b_{1} \\
b_{2} \\
\vdots \\
b_{K}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_{1} \\
\epsilon_{2} \\
\vdots \\
\epsilon_{N}
\end{bmatrix} $
it all comes down to minimzing $e'e$:
$\epsilon'\epsilon = \begin{bmatrix}
e_{1} & e_{2} & \cdots & e_{N} \\
\end{bmatrix}
\begin{bmatrix}
e_{1} \\
e_{2} \\
\vdots \\
e_{N}
\end{bmatrix} = \sum_{i=1}^{N}e_{i}^{2}
$
So minimizing $e'e'$ gives us:
$min_{b}$ $e'e = (y-Xb)'(y-Xb)$
$min_{b}$ $e'e = y'y - 2b'X'y + b'X'Xb$
$\frac{\partial(e'e)}{\partial b} = -2X'y + 2X'Xb \stackrel{!}{=} 0$
$X'Xb=X'y$
$b=(X'X)^{-1}X'y$
One last mathematical thing, the second order condition for a minimum requires that the matrix $X'X$ is positive definite. This requirement is fulfilled in case $X$ has full rank.
The more accurate derivation which goes trough all the steps in greater dept can be found under http://economictheoryblog.com/2015/02/19/ols_estimator/
Best Answer
Because the regression line fit by ordinary least squares will necessarily go through the mean of your data (i.e., $(\bar x, \bar y)$)—at least as long as you don't suppress the intercept—uncertainty about the true value of the slope has no effect on the vertical position of the line at the mean of $x$ (i.e., at $\hat y_{\bar x}$). This translates into less vertical uncertainty at $\bar x$ than you have the further away from $\bar x$ you are. If the intercept, where $x=0$ is $\bar x$, then this will minimize your uncertainty about the true value of $\beta_0$. In mathematical terms, this translates into the smallest possible value of the standard error for $\hat\beta_0$.
Here is a quick example in
R
:This figure is a bit busy, but you can see the data from several different studies where the distribution of $x$ was closer or further from $0$. The slopes differ a little from study to study, but are largely similar. (Notice they all go through the circled X that I used to mark $(\bar x, \bar y)$.) Nonetheless, the uncertainty about the true value of those slopes causes the uncertainty about $\hat y$ to expand the further you get from $\bar x$, meaning that the $SE(\hat\beta_0)$ is very wide for the data that were sampled in the neighborhood of $x=10$, and very narrow for the study in which the data were sampled near $x=0$.
Edit in response to comment: Unfortunately, centering your data after you have them will not help you if you want to know the likely $y$ value at some $x$ value $x_\text{new}$. Instead, you need to center your data collection on the point you care about in the first place. To understand these issues more fully, it may help you to read my answer here: Linear regression prediction interval.