From the least squares estimation method, we know that $$\hat{\beta}=(X'X)^{-1}X'Y$$ and that $\hat{\beta}$ is an unbiased estimator of $\beta$, i.e $E[\hat{\beta}]=\beta$. Moreover, the linear model $$\begin{equation} Y=X\beta +u \end{equation}$$ has the assumption that $$Y\sim N(\mu=\beta_0+\beta_1x,\sigma)$$ or equivalently that $u \sim N(\mu=0,\sigma)$. Based on the above we can prove all three results (simultaneously) by calculating the variance-covariance matrix of $b$ which is equal to: $$Var(\hat{\beta)}:=\sigma^2(\hat{\beta})=\left( \begin{array}{ccc}
Var(\hat{\beta_0}) & Cov(\hat{\beta_0},\hat{\beta_1}) \\
Cov(\hat{\beta_0},\hat{\beta_1}) & Var(\hat{\beta_1}) \end{array} \right)$$ By the properties of variance we have that
$$\begin{align*}Var(\hat{\beta})&=E[\hat{\beta}\phantom{}^2]-E[\hat{\beta}]^2=E[((X'X)^{-1}X'Y)^2]-\beta^2=E[((X'X)^{-1}X'(X\beta +u))^2]-\beta^2=\\&=E[((X'X)^{-1}X'X\beta +(X'X)^{-1}X'u))^2]-\beta^2=E[(\beta+(X'X)^{-1}X'u))^2]-\beta^2=\\&=E[\beta^2]+2(X'X)^{-1}X'E[u]+E[((X'X)^{-1}X'u))^2]-\beta^2=\\&=\beta^2+0+E[((X'X)^{-1}X'u))^2]-\beta^2=E[((X'X)^{-1}X'u))^2]=\\&=\left((X'X)^{-1}X'\right)^2\cdot E[u^2]\end{align*}$$But, since $E[u]=0$ we have that $E[u^2]=Var(u)=\sigma^2$ and by substituting in the above equation we find that $$\begin{align*}Var(\hat{\beta})&=\left((X'X)^{-1}X'\right)^2\cdot E[u^2]=(X'X)^{-1}X'\cdot(X'X)^{-1}X'\cdot\sigma^2=\sigma^2(X'X)^{-1}\cdot I=\\&=\sigma^2(X'X)^{-1}.\end{align*}$$
Now, since $$(X'X)^{-1}=\left( \begin{array}{ccc}
\frac{\sum x_i^2}{n\sum (x_1-\bar{x})^2} & \frac{-\sum x_i}{n\sum (x_1-\bar{x})^2} \\
\frac{-\sum x_i}{n\sum (x_1-\bar{x})^2} & \frac{1}{\sum (x_1-\bar{x})^2} \end{array} \right)$$ (which is also known or can be easily derived algebraically) you have the result that: $$\begin{align*} Var(\hat{\beta})&=\left( \begin{array}{ccc}
Var(\hat{\beta_0}) & Cov(\hat{\beta_0},\hat{\beta_1}) \\
Cov(\hat{\beta_0},\hat{\beta_1} & Var(\hat{\beta_1}) \end{array} \right)=\sigma^2\left(X'X\right)^{-1}=\\&\phantom{kl}\\&=\left( \begin{array}{ccc}
\frac{\sigma^2 \sum x_i^2}{n\sum (x_1-\bar{x})^2} & \frac{-\sigma^2 \sum x_i}{n\sum (x_1-\bar{x})^2} \\
\frac{-\sigma^2 \sum x_i}{n\sum (x_1-\bar{x})^2} & \frac{\sigma^2}{\sum (x_1-\bar{x})^2} \end{array} \right) \end{align*}$$
One has
$$
\hat\beta_1 = \frac{\sum_{i=1}^n (y_i-\bar y)(x_i-\bar x)}{\sum_{i=1}^n (x_i - \bar x)^2}
$$
where $\bar y = (y_1+\cdots+y_n)/n$ and $\bar x = (x_1+\cdots+x_n)/n$.
This is nonlinear as a function of $x_1,\ldots,x_n$ since there is division by a function of the $x$s and there is squaring. But it is linear as a function of $y_1,\ldots,y_n$. To see that, first observe that the denominator does not depend on $y_1,\ldots,y_n$, so we need only look at the numerator.
So look at
$$
y_i-\bar y = y_i - \frac{y_1 + \cdots + y_i + \cdots + y_n}{n} = \frac{-y_1 - y_2 - \cdots+(n-1)y_i-\cdots - y_n}{n} $$
This is linear in $y_1,\ldots,y_n$. Since the quantities $x_i-\bar x$, $i=1,\ldots,n$ do not depend on $y_1,\ldots,y_n$, the expression
$$
\sum_{i=1}^n (y_i-\bar y)(x_i-\bar x)
$$
is a linear combination of expressions each of which we just said is linear in $y_1,\ldots,y_n$. It is therefore itself a linear combination of $y_1,\ldots,y_n$.
Next, we have $\bar y = \hat\beta_0 + \hat\beta_1 \bar x$, so $\beta_0 = \bar y - \hat\beta_1\bar x$. Since $\hat y$ is a linear combination of $y_1,\ldots,y_n$ and we just got done showing that $\hat\beta_1$ is a linear combination of $y_1,\ldots,y_n$, and $\bar x$ does not depend on $y_1,\ldots,y_n$, it follows that $\hat\beta_0$ is a linear combination of $y_1,\ldots,y_n$.
MATRIX VERSION:
$$
\begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_1 \\ \vdots & \vdots \\ 1 & X_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix}
$$
$$
Y = M\beta + \varepsilon
$$
$$
\varepsilon \sim N_n( 0_n, \sigma^2 I_n)
$$
where $0_n\in\mathbb R^{n\times 1}$ and $I_n\in\mathbb R^{n\times n}$ is the identity matrix. Consequently
$$
Y\sim N_n(M\beta,\sigma^2 I_n).
$$
One can show (and I show further down below) that
$$
\hat\beta = (M^\top M)^{-1}M^\top Y. \tag 1
$$
Therefore
$$
\hat\beta \sim N_2(\Big((M^\top M)^{-1}M^\top\Big) M\beta,\quad (M^\top M)^{-1}M^\top\Big(\sigma^2 I_n\Big)M(M^\top M)^{-1})
$$
$$
= N_2( M\beta,\quad \sigma^2 (M^\top M)^{-1}).
$$
Here I have used the fact that when one multiplies a normally distributed column vector on the left by a constant (i.e. non-random) matrix, the expected value gets multiplied by the same matrix on the left and the variance gets multiplied on the left by that matrix and on the right by its transpose.
So how does one prove $(1)$?
"Least squares" means the vector $\hat Y$ of fitted values is the orthogonal projection of $Y$ onto the column space of $M$. That projection is
$$
\hat Y = M(M^\top M)^{-1}M^\top Y. \tag 2
$$
To see that that is the orthogonal projection, consider two things: Suppose $Y$ were orthogonal to the column spacee of $M$. Then the product $(2)$ must be $0$ since the product of the last two factors, ,$M^\top Y$, would be $0$. The suppose $Y$ is actually in the column space of $M$. Then $Y=M\gamma$ for some $\gamma\in \mathbb R^{2\times 1}$. Put $M\gamma$ into $(2)$ and simplify and the product will be $M\gamma=Y$, so that vectors in the column space are mapped to themselves.
Now we have
$$
M\hat\beta=\hat Y = M(M^\top M)^{-1} M^\top Y. \tag 3
$$
If we could multiply both sides of $(3)$ on the left by an inverse of $M$, we'd get $(1)$. But $M$ is not a square matrix and so has no inverse. But $M$ is a matrix with linearly independent columns and therefore has a left inverse, and that does the job. Its left inverse is
$$
(M^\top M)^{-1}M^\top.
$$
The left inverse is not unique, but this is the one that people use in this context.
Best Answer
Some hints, but not quite the full answer:
There is a difference between a parameter $\mu$ and an estimator of that parameter. So if we call the estimator $\hat{\mu}$ then you want to minimise $$\sum_i (y_i - \hat{\mu})^2$$ which is $$\sum_i y_i^2 - \sum_i 2 y_i \hat{\mu} +\sum_i \hat{\mu} ^2$$ and (as you suggest) this will be when its derivative with respect to $\hat{\mu}$ is zero. Strictly speaking you should check this is a minimum, but since the derivative is monotone increasing that is obvious.
Since $y_i = \mu + \epsilon_i$, you know $E[y_i] = E[\mu] + E[\epsilon_i]$, so it will be easy to find $E[\hat{\mu}]$.
As for $Var(\hat{\mu})$, you again have to multiply out a square, looking at $$E\left[\left(\hat{\mu}-E[\hat{\mu}]\right)^2\right].$$ You might want to use the fact that $y_i^2 = \mu^2 + 2 \mu \epsilon_i +\epsilon_i^2$ implies $E[y_i^2] = \mu^2 + \sigma^2$.