One has
$$
\hat\beta_1 = \frac{\sum_{i=1}^n (y_i-\bar y)(x_i-\bar x)}{\sum_{i=1}^n (x_i - \bar x)^2}
$$
where $\bar y = (y_1+\cdots+y_n)/n$ and $\bar x = (x_1+\cdots+x_n)/n$.
This is nonlinear as a function of $x_1,\ldots,x_n$ since there is division by a function of the $x$s and there is squaring. But it is linear as a function of $y_1,\ldots,y_n$. To see that, first observe that the denominator does not depend on $y_1,\ldots,y_n$, so we need only look at the numerator.
So look at
$$
y_i-\bar y = y_i - \frac{y_1 + \cdots + y_i + \cdots + y_n}{n} = \frac{-y_1 - y_2 - \cdots+(n-1)y_i-\cdots - y_n}{n} $$
This is linear in $y_1,\ldots,y_n$. Since the quantities $x_i-\bar x$, $i=1,\ldots,n$ do not depend on $y_1,\ldots,y_n$, the expression
$$
\sum_{i=1}^n (y_i-\bar y)(x_i-\bar x)
$$
is a linear combination of expressions each of which we just said is linear in $y_1,\ldots,y_n$. It is therefore itself a linear combination of $y_1,\ldots,y_n$.
Next, we have $\bar y = \hat\beta_0 + \hat\beta_1 \bar x$, so $\beta_0 = \bar y - \hat\beta_1\bar x$. Since $\hat y$ is a linear combination of $y_1,\ldots,y_n$ and we just got done showing that $\hat\beta_1$ is a linear combination of $y_1,\ldots,y_n$, and $\bar x$ does not depend on $y_1,\ldots,y_n$, it follows that $\hat\beta_0$ is a linear combination of $y_1,\ldots,y_n$.
MATRIX VERSION:
$$
\begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_1 \\ \vdots & \vdots \\ 1 & X_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix}
$$
$$
Y = M\beta + \varepsilon
$$
$$
\varepsilon \sim N_n( 0_n, \sigma^2 I_n)
$$
where $0_n\in\mathbb R^{n\times 1}$ and $I_n\in\mathbb R^{n\times n}$ is the identity matrix. Consequently
$$
Y\sim N_n(M\beta,\sigma^2 I_n).
$$
One can show (and I show further down below) that
$$
\hat\beta = (M^\top M)^{-1}M^\top Y. \tag 1
$$
Therefore
$$
\hat\beta \sim N_2(\Big((M^\top M)^{-1}M^\top\Big) M\beta,\quad (M^\top M)^{-1}M^\top\Big(\sigma^2 I_n\Big)M(M^\top M)^{-1})
$$
$$
= N_2( M\beta,\quad \sigma^2 (M^\top M)^{-1}).
$$
Here I have used the fact that when one multiplies a normally distributed column vector on the left by a constant (i.e. non-random) matrix, the expected value gets multiplied by the same matrix on the left and the variance gets multiplied on the left by that matrix and on the right by its transpose.
So how does one prove $(1)$?
"Least squares" means the vector $\hat Y$ of fitted values is the orthogonal projection of $Y$ onto the column space of $M$. That projection is
$$
\hat Y = M(M^\top M)^{-1}M^\top Y. \tag 2
$$
To see that that is the orthogonal projection, consider two things: Suppose $Y$ were orthogonal to the column spacee of $M$. Then the product $(2)$ must be $0$ since the product of the last two factors, ,$M^\top Y$, would be $0$. The suppose $Y$ is actually in the column space of $M$. Then $Y=M\gamma$ for some $\gamma\in \mathbb R^{2\times 1}$. Put $M\gamma$ into $(2)$ and simplify and the product will be $M\gamma=Y$, so that vectors in the column space are mapped to themselves.
Now we have
$$
M\hat\beta=\hat Y = M(M^\top M)^{-1} M^\top Y. \tag 3
$$
If we could multiply both sides of $(3)$ on the left by an inverse of $M$, we'd get $(1)$. But $M$ is not a square matrix and so has no inverse. But $M$ is a matrix with linearly independent columns and therefore has a left inverse, and that does the job. Its left inverse is
$$
(M^\top M)^{-1}M^\top.
$$
The left inverse is not unique, but this is the one that people use in this context.
I'm going to show this using partial differentiation.
Consider the assumed linear model
$$y_i = \mathbf{x}_i^{T}\boldsymbol\beta + \epsilon_i$$
where $y_i, \epsilon_i \in \mathbb{R}$ and $\mathbf{x}_i=\begin{bmatrix}
x_{i0} \\
x_{i1} \\
\vdots \\
x_{ip}
\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_p
\end{bmatrix} \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$, with $x_{i0} = 1$.
Our aim is to solve for $\hat{\boldsymbol\beta}$ by minimizing the residual sum of squares, or minimizing $$\text{RSS}(\boldsymbol\beta) = \sum_{i=1}^{n}(y_i-\mathbf{x}_i^{T}\boldsymbol\beta)^2\text{.}$$
To compute this sum, consider the vector of residuals
$$\mathbf{e}=\begin{bmatrix}
y_1 - \mathbf{x}_1^{T}\boldsymbol\beta \\
y_2 - \mathbf{x}_2^{T}\boldsymbol\beta \\
\vdots \\
y_n - \mathbf{x}_n^{T}\boldsymbol\beta
\end{bmatrix}$$
Then $\text{RSS}(\boldsymbol\beta) = \mathbf{e}^{T}\mathbf{e}$. Our next step is to find the "partial derivatives" of $\text{RSS}(\boldsymbol\beta)$.
To do this, note that for $k = 1, \dots, p$,
$$\dfrac{\partial \text{RSS}}{\partial \beta_k}=\dfrac{\partial}{\partial\beta_k}\left\{\sum_{i=1}^{n}\left[y_i- \sum_{j=0}^{p}\beta_jx_{ij}\right]^2 \right\}=-2\sum_{i=1}^{n}x_{ik}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right)\text{.}$$
"Stacking" these, we obtain
$$\begin{align}
\dfrac{\partial \text{RSS}}{\partial \boldsymbol\beta}&=\begin{bmatrix}
\dfrac{\partial \text{RSS}}{\partial \beta_0} \\
\dfrac{\partial \text{RSS}}{\partial \beta_1} \\
\vdots \\
\dfrac{\partial \text{RSS}}{\partial \beta_p}
\end{bmatrix} \\
&= \begin{bmatrix}
-2\sum_{i=1}^{n}x_{i0}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \\
-2\sum_{i=1}^{n}x_{i1}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \\
\vdots \\
-2\sum_{i=1}^{n}x_{ip}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right)
\end{bmatrix} \\
&= -2\begin{bmatrix}
\sum_{i=0}^{n}x_{i0}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta)\\
\sum_{i=0}^{n}x_{i1}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta) \\
\vdots \\
\sum_{i=0}^{n}x_{ip}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta)
\end{bmatrix} \\
&= -2\left(\begin{bmatrix}
\sum_{i=0}^{n}x_{i0}\mathbf{y}\\
\sum_{i=0}^{n}x_{i1}\mathbf{y} \\
\vdots \\
\sum_{i=0}^{n}x_{ip}\mathbf{y}
\end{bmatrix} - \begin{bmatrix}
\sum_{i=0}^{n}x_{i0}\mathbf{x}_1^{T}\boldsymbol\beta)\\
\sum_{i=0}^{n}x_{i1}\mathbf{x}_1^{T}\boldsymbol\beta) \\
\vdots \\
\sum_{i=0}^{n}x_{ip}\mathbf{x}_1^{T}\boldsymbol\beta)
\end{bmatrix}\right)\\
&= -2(\mathbf{X}^{T}\mathbf{y}-\mathbf{X}^{T}\mathbf{X}\boldsymbol\beta)\text{.}
\end{align}$$
where $$\mathbf{X} = \begin{bmatrix}
\mathbf{x}_1^{T} \\
\mathbf{x}_2^{T} \\
\vdots \\
\mathbf{x}_n^{T}
\end{bmatrix}\text{.}$$
Setting $\dfrac{\partial \text{RSS}}{\partial \boldsymbol\beta} = \mathbf{0}$, we obtain $$\mathbf{X}^{T}\mathbf{X}\boldsymbol\beta = \mathbf{X}^{T}\mathbf{y}$$
and assuming $\mathbf{X}^{T}\mathbf{X}$ is invertible,
$$\hat{\boldsymbol\beta} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}$$
Best Answer
You find $\hat{a}$ and $\hat{b}$ by looking for minima of the function (in $a$ and $b$):
$$\sum_{i=1}^n e_i^2=\sum_{i=1}^n(y_i-a-bx_i)^2$$
so taking partial derivatives in $a$ and $b$ and making them equal to $0$ yields:
$$0=\left[\frac{\partial}{\partial a}\sum_{i=1}^n(y_i-a-bx_i)^2\right]_{a=\hat{a},b=\hat{b}}=-2\sum_{i=1}^n(y_i-\hat{a}-\hat{b}x_i)=-2\sum_{i=1}^n \hat{e_i}$$
$$0=\left[\frac{\partial}{\partial b}\sum_{i=1}^n(y_i-a-bx_i)^2\right]_{a=\hat{a},b=\hat{b}}=-2\sum_{i=1}^n(y_i-\hat{a}-\hat{b}x_i)x_i=-2\sum_{i=1}^n \hat{e_i}x_i$$
which gives you the desired properties.