Supposing (unobservable) errors (as opposed to observable residuals) are independent and identitcally distributed, and normally distributed with expectation $0$ and variance $\sigma^2$, then the least-squares estimate $\hat a$ of the slope $a$ satisfies
$$
\hat a \sim N\left(a, \frac{\sigma^2}{\sum_i (x_i-\bar x)^2} \right)
$$
where $\bar x=(x_1+\cdots+x_n)/n$.
Now let $\hat\sigma^2$ be the estimate of $\sigma^2$ that comes from dividing the sum of squares of observed residuals by the degrees of freedom, in this case $n-2$. Then
$$
(n-2) \frac{\hat\sigma^2}{\sigma^2}\sim \chi^2_{n-2},
$$
and (believe it or not!) this random variable is independent of $\hat a$.
Therefore
$$
t = \frac{\hat a - a}{\hat\sigma/\sqrt{\sum_i(x_i-\bar x)^2}}
$$
has a $t$-distribution with $n-2$ degrees of freedom.
So
$$
\hat a \pm A \frac{\hat\sigma}{\sqrt{\sum_i(x_i-\bar x)^2}}
$$
are the endpoints of a confidence interval for $a$. The number $A$ you get from one of the standard tables for the $t$ distribution, and depends on what confidence level you want.
Two thoughts on the updated version of the question:
- The estimates of slope and intercept are correlated with each other (except when the mean of the $x$-values is $0$. If the line is $y=ax+b$, then we'd have a confidence region for the pair $(a,b)$ that would be an ellipse whose axes are not parallel to the $x$- and $y$-axes. So it would be somewhat misleading to express them separately as $\hat a\pm\text{something}$ and $\hat b\pm\text{something}$. However, if the line is expressed point-slope form, as $y-\bar y=a(x-\bar x)$, where $\bar y$ and $bar x$ are the mean values, then $\bar y$ would be uncorrelated with $\hat a$.
- All this assumes the $y$s are random variables and the $x$s are fixed. That can make some sense if all results are regarded as conditional expected values, given the $x$ values. It's pretty conventional to do things that way, but possibly in some situations it's not appropriate.
Least squares
You want to find the parameters for a model which best describes the data. Furthermore, you have specified that you want the best fit with respect to the $l_{2}$ norm. Let's look at a simpler case which allows us to explore the consequences of these choices.
Find the average
Computing the average is computing a least squares solution. Mathematical details follow.
Input data
Start with a sequence of $m$ measurements $\left\{ x_{k} \right\}^{m}_{k=1}$. Perhaps these numbers are test scores for a class.
Model
How would you characterize the performance of the class? Your model is simple:
$$
y(x) = \mu
$$
We know this number $\mu$ will be the average. The free parameter in the least squares fit is the constant $\mu$.
Least squares problem
The least squares problem minimizes the sum of the squares of the differences between the measurement and the prediction. Formally,
$$
\mu_{LS} = \left\{
\mu \in \mathbb{R} \colon
\sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} \text{ is minimized}
\right\}
$$
The function
$$
\sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2}
$$
is called a merit function. This is the target of minimization.
Least squares solution
We know how to find extrema for functions: we look for the points where the derivatives are $0$. Remember, the parameter of variation here is $\mu$.
$$
\frac{d}{d\mu} \sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} = 0
\tag{1}
$$
Sticklers may protest that this finds extrema, yet we need minima. These fears will be allayed by posting the question "How do we know that least squares solutions form a convex set?".
The derivative is
$$
\begin{align}
\frac{d}{d\mu} \sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} &= - 2 \sum_{k=1}^{m} \left( x_{k} - \mu \right)
\\ &= -2 \left ( \sum_{k=1}^{m} x_{k} - \mu \sum_{k=1}^{m} 1 \right )
\\ &= -2 \left ( \sum_{k=1}^{m} x_{k} - m \mu \right )
\end{align}
\tag{2}
$$
Using the results of $(2)$ in $(1)$ produces the answer
$$
m \mu = \sum_{k=1}^{m} x_{k}
\qquad \Rightarrow \qquad
\boxed{
\mu = \frac{1}{m} \sum_{k=1}^{m} x_{k}
}
$$
The answer is the average best typifies a set of test scores.
Not surprising, but revealing.
Example
Sample data
$$
\begin{array}{cc}
k & x\\\hline
1 & 81 \\
2 & 11 \\
3 & 78 \\
4 & 18 \\
5 & 24 \\
\end{array}
$$
Solution
The merit function, the target of minimization, is
$$
\begin{align}
\sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} &= (11-\mu )^2+(18-\mu )^2+(24-\mu )^2+(78-\mu )^2+(81-\mu )^2
\\
&= 5 \mu ^2-424 \mu +13666
\end{align}
$$
Minimizing this function of $\mu$ would not give you a moment's hesitation.
$$
\frac{d}{d\mu}\left(5 \mu ^2-424 \mu +13666\right) = -424 + 10 \mu = 0
$$
The answer is the average
$$
\mu = \frac{\sum_{k=1}^{m} x_{k}}{m} = \frac{212}{5} = 42.4
$$
Visualization
The figure on the left shows the scores for students $1-5$, with the average a dashed line. The right panel shows equation $(1)$ and how it varies with $\mu$. Hopefully, this panel illustrates why you are looking for $0$s of the first derivative.
Notice that the sum of the squares of the errors is not $0$. The sum of the squares of the errors takes the minimum value of $4677.2$ when $\mu = 42.4$.
In summary, step back from the the linear regression case, and look at this example as a problem in calculus.
Final question
Your final question
Why does a coefficient like a or b become so important? Why and how is a coefficient so prominently related to error? How does it affect anything?
opens another door to deep insight. Let's defer that answer to a new question like How stable are least squares solutions against variations in the data?
Best Answer
It seems to me that this is rather a question of "what method your instructor will accept" rather than "how to do the problem". So you should ask him/her.
However, if I can assume from the tag that you are doing a linear algebra course, and if the least squares method has been suggested, then perhaps the following is what is required.
First try to make the line $y=kx+b$ go through all the given points: we need $$k+b=4\,,\ -2k+b=5\,,\ 3k+b=-1\,,\ 4k+b=1\,.$$ Writing this in matrix form, $$\pmatrix{1&1\cr-2&1\cr3&1\cr4&1\cr}\pmatrix{k\cr b\cr} =\pmatrix{4\cr5\cr-1\cr1\cr}\ .$$ However the system has no solution (check this for yourself) and so there is no line going through the four points. To find the line which fits the points as nearly as possible, in the least squares sense, we write the above as $$A\pmatrix{k\cr b\cr}={\bf y}$$ and solve the normal equations $$A^TA\pmatrix{k\cr b\cr}=A^T{\bf y}\ .$$ It can be shown that if $A$ has linearly independent columns, then the normal equations will always have a unique solution. In this case we have $$\pmatrix{30&6\cr6&4\cr}\pmatrix{k\cr b\cr}=\pmatrix{-5\cr9\cr}\ ,$$ and you can solve this to find the "best" values of $k$ and $b$. They should be the same as those found by the method in your question, but I think you may have an error in your arithmetic.
Comment. This method is superior to the one you have suggested in your post because it is much more flexible. For example you can use exactly the same ideas to find the quadratic $y=a+bx+cx^2$ which best fits the data points, or the curve $y=a\cos x+b\sin x$ which best fits them, and so on.