Actually, $\hat{\beta}_0$ isn't constant - it depends on the $y_i$ (which follow a normal distribution). Note that in most cases, this variance would be likely computed for a prediction interval since you're working with a new $x$-value.
Using properties of variances and covariances,
$$\begin{align}
\newcommand{\Var}[1]{\text{Var}\left(#1\right)}\newcommand{\Cov}[2]{\text{Cov}\left(#1, #2\right)}\Var{y_0} &= \Var{\hat{\beta}_0}+\Var{\hat{\beta}_1x_0} + 2\Cov{\hat{\beta}_0}{\hat{\beta}_1x_0} \\
&= \Var{\hat{\beta}_0}+x_0^2\Var{\hat{\beta}_1}+2x_0\Cov{\hat{\beta}_0}{\hat{\beta}_1}
\end{align}$$
Now recall that the $y_i$ (not the predicted values) are based on
$$y_i = \beta_0+\beta_1x_i+\epsilon_i\text{, } \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)$$
You can derive $\Var{\hat{\beta}_0}$ using the above, as here. You can also see here the derivation of $\Var{\hat{\beta}_1}$.
I haven't been able to find a derivation of the covariance. One way you could do this is by using
$$\Cov{\hat{\beta}_0}{\hat{\beta}_1} = \Cov{\bar{y}-\hat{\beta}_1\bar{x}}{\hat{\beta}_1} = \Cov{\bar{y}}{\hat{\beta}_1} - \bar{x}\Var{\hat{\beta}_1}$$
and this might be helpful.
The derivation is very, very tedious and long, so I wouldn't expect to see this on an exam.
Unfortunately, it's been a long time since I've done these derivations, and I'm used to doing them using matrices (which, quite frankly, is a lot cleaner).
What follows is a complete explanation with more detail than what might be needed for a full understanding.
In the linear regression model $$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$ the only random variable on the right-hand side of this equation is $$\epsilon_i \sim \operatorname{Normal}(0, \sigma^2).$$ Everything else is either a parameter ($\beta_0$, $\beta_1)$ or a covariate ($x_i$). The left-hand side $y_i$ is therefore a random variable, whose randomness is attributed to the error term. As the errors are independent, so are the responses.
Note there is no parameter estimation mentioned here. $\beta_0$ and $\beta_1$ represent the true parameters for the model, in the sense that if you were to make numerous observations of the response for a given value of $x_i$, you would find that these would be normally distributed with mean $\mu_i = \beta_0 + \beta_1 x_i$ and variance $\sigma^2$.
The best way to understand the random variable $$\sum_{i=1}^n (y_i - \bar y)^2$$ is to first ask, if a single $y_i$ has expectation $\mu_i$, then what is the expectation of the sample mean $\bar y$? This easily follows from the law of total expectation: $$\operatorname{E}[\bar y] = \frac{1}{n} \sum_{i=1}^n \operatorname{E}[y_i] = \frac{1}{n} \sum_{i=1}^n \beta_0 + \beta_1 x_i = \beta_0 + \beta_1 \bar x,$$ which is simply the response at the mean value of the covariate. Let's call this value $\mu$.
Now it is easy to partition the sum of squares, knowing that by construction, the mean deviation of $\bar y$ from $\mu$ is zero:
$$\begin{align*}
\sum_{i=1}^n (y_i - \bar y)^2 &= \sum_{i=1}^n (y_i - \mu + \mu - \bar y)^2 \\
&= \sum_{i=1}^n \left( (y_i - \mu)^2 + 2(y_i - \mu)(\mu - \bar y) + (\mu - \bar y)^2 \right) \\
&= \sum_{i=1}^n (y_i - \mu)^2 + 2 (\mu - \bar y) \sum_{i=1}^n (y_i - \mu) + n(\mu - \bar y)^2 \\
&= \sum_{i=1}^n (y_i - \mu)^2 + 2 (\mu - \bar y)(n \bar y - n \mu) + n(\mu - \bar y)^2 \\
&= \sum_{i=1}^n (y_i - \mu)^2 - n(\mu - \bar y)^2.
\end{align*}$$
The expected value $\operatorname{E}[(\bar y - \mu)^2] = \operatorname{Var}[\bar y]$ by definition, and by the independence of responses, $$\operatorname{Var}[\bar y] \overset{\text{ind}}{=} \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}[y_i] = \frac{\sigma^2}{n}.$$ So all that remains is to compute the expectation of the first term. But
$$\operatorname{E}\left[\sum_{i=1}^n (y_i - \mu)^2\right] = \sum_{i=1}^n \operatorname{E}[(y_i - \mu)^2],$$ and since
$$\begin{align*}\operatorname{E}[(y_i - \mu)^2]
&= \operatorname{E}[(y_i - \mu_i + \mu_i - \mu)^2] \\
&= \operatorname{E}[(y_i - \mu_i)^2] + 2(\mu_i - \mu) \operatorname{E}[y_i - \mu_i] + (\mu_i - \mu)^2 \\
&= \operatorname{Var}[y_i] + (\mu_i - \mu)^2 \\
&= \sigma^2 + (\mu_i - \mu)^2,
\end{align*}$$
we obtain after putting everything together
$$\begin{align*}
\operatorname{E}\left[\sum_{i=1}^n (y_i - \bar y)^2\right]
&= n \sigma^2 + \sum_{i=1}^n (\mu_i - \mu)^2 - \sigma^2 \\
&= (n-1)\sigma^2 + \sum_{i=1}^n (\beta_0 + \beta_1 x_i - (\beta_0 + \beta_1 \bar x))^2 \\
&= (n-1)\sigma^2 + \beta_1^2 \sum_{i=1}^n (x_i - \bar x)^2
\end{align*}$$
as claimed.
Throughout our discussion, the only random variables here have been $\epsilon_i$ and any functions of these, such as $y_i$ and $\bar y$. The quantities $\mu_i$ and $\mu$ are not random, being functions of the parameters and covariate. It is important to keep this in mind.
Best Answer
In what follows, we work with matrix-vector products, which means we multiply column vectors on the left by matrices (thus differing from the reference).
Let S = $\{ (x_{i}, y_{i})\}_{i=1}^{n}$ denote the collection of data, where $x_{i} \in \mathbb{R}^{p}$ and $y_{i} \in \mathbb{R}$ for $i=1,\dots,n.$ In the first case, we consider a mean-centered linear model: $$y_{i} = a_{0} + \sum_{j=1}^{p} a_{j} (x_{ij} - \bar{x}_{j}) + \epsilon_{i} \qquad for \quad i=1\dots,n,$$ where $\bar{x}_{j} = \frac{\sum_{i=1}^{n} x_{ij}}{n}$ is the center of mass for the $j$th feature.
Now, we express it in matrix notation (for convenience). Namely, let $$X-\bar{X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \\ \end{pmatrix} - \begin{pmatrix} 0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\ 0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\ \end{pmatrix}, $$ $\epsilon = \begin{pmatrix} \epsilon_{1} & \epsilon_{2} & \cdots & \epsilon_{n} \end{pmatrix}^{t}, $ and $Y = \begin{pmatrix} y_{1} & y_{2} & \cdots & y_{n} \end{pmatrix}^{t}. $ Then, we have $Y = (X - \bar{X})a + \epsilon,$ where $a = \begin{pmatrix} a_{0} & a_{1} & \cdots & a_{p} \end{pmatrix}^{t}, $ and the right hand side simplifies to $Xa - \bar{X}a + \epsilon.$
In the second case, we consider a linear model: $$y_{i} = b_{0} + \sum_{j=1}^{p} b_{j} x_{ij} + \epsilon_{i} \qquad for \quad i=1\dots,n.$$ Subsequently, we express it in matrix notation such that $Y = Xb + \epsilon,$ where $X$ and $\epsilon$ are as above and $b$ must be deduced. To find the elements of $b,$ we rearrange the right hand side of the mean-centered linear model to obtain: $$y_{i} = a_{0} - \sum_{j=1}^{p} a_{j} \bar{x}_{j} + \sum_{j=1}^{p}a_{j}x_{ij} + \epsilon_{i} \qquad for \quad i=1\dots,n.$$ This suggests $b_{0} = a_{0} - \sum_{j=1}^{p} a_{j} \bar{x}_{j}$ and $b_{j} = a_{j}$ for $j=1,\dots,p,$ so we have found all of the elements in $b.$ Hence, we observe that both linear models are identical. (From here, derive the estimates.)
A note of caution: in (1.3) of the reference, they do not apply the inverse, whereas your expression contains the inverse. In your case, you have assumed invertibility, which is equivalent to the $p$ features being linearly independent.