The present answer is based on data $(x_1,y_1),(x_2,y_2),...,(x_k,y_k),...,(x_n,y_n)$ coming from a scan of the graph published in the Morten Bakkedal's question. A graphical scan is not an accurate technic. So, the next results will be less accurate than if data was coming from a numeric table.
In a preliminary inspection, roughly two distinct points alignments can be observed. This draw us to propose to separate the set of points into two sets and perform a linear regression for each one.
Even with a so simplistic method, the result is surprisingly good. See Figure 1. The RMSD $\simeq 0.00115$ is not bad.
In fact, this is the fitting of a piecewise function :
$$y(x)=\begin{cases} px+q \qquad\text{if } x\leq X \\ rx+s \qquad\text{if } x>X \end{cases}$$
$(X,Y)$ is the intersection point between the two rays (as defined on Figure 1).
The method of computation of the values of the adjustable parameters $p,q,r,s$ is shown below :
Of course, this supposes an initial guess of $K$ which separates the two sets of points. This is an easy guess just by visual inspection. Moreover, if the guess is not the best, it is easy to check it afterwards : The value of $X$ is computed and must be between $x_K$ and $x_{K+1}$. If not, there is no difficulty to correct the guessed $K$.
An equivalent form of the equation to be fitted is :
$$y(x)= (px+q)\text{H}(X-x)+(rx+s)\text{H}(x-X)$$
where H is the Heaviside's step function.
Again this supposes an initial guess, now $X$. Again, this is an easy guess just by visual inspection. Again, if the guess is not the best, it is easy to check it afterwards and correct it if necessary.
One can use an approximate of the Heaviside function. A constant $c$ appears into it, but this is not a parameter to be adjusted. One can set $c=10$ or $20$ or $50$ or more. It doesn't matter. The resulting fit remains the same and the curve on figure $1$ doesn't change.
The fitted function is :
$$y(x)= \frac{px+q}{1+e^{c(X-x)}}+\frac{rx+s}{1+e^{c(x-X)}}$$
where $c=10$ for example. Doesn't matter the value of $c$ insofar $c$ is large.
The result of the linear regression for $p,q,r,s$ is the same as above and leads to the same Figure 1.
The only imperfection is that the transition between the two straight lines isn't smooth. The advantage of the above approach is that a smoothness can be obtained very easily in using a worse approximate of the Heaviside function.
This consists in using a smaller value of $c$. Of course, this introduces one adjustable parameter more. But there is no need for an accurate value of $c$. Taking it with order of magnitude around $1$ is sufficient to achieve a smooth transition.
For example, see Figure 3. The fitting is very good. RMSD$\simeq 0.00045$
CASE OF THE SECOND GRAPH published by Morten Bakkedal.
In this case, the range where the function changes of slope isn't reached. So, one linear regression on the form $y(x)=px+q$ is sufficient to provide a signifiant answer.
Least squares
You want to find the parameters for a model which best describes the data. Furthermore, you have specified that you want the best fit with respect to the $l_{2}$ norm. Let's look at a simpler case which allows us to explore the consequences of these choices.
Find the average
Computing the average is computing a least squares solution. Mathematical details follow.
Input data
Start with a sequence of $m$ measurements $\left\{ x_{k} \right\}^{m}_{k=1}$. Perhaps these numbers are test scores for a class.
Model
How would you characterize the performance of the class? Your model is simple:
$$
y(x) = \mu
$$
We know this number $\mu$ will be the average. The free parameter in the least squares fit is the constant $\mu$.
Least squares problem
The least squares problem minimizes the sum of the squares of the differences between the measurement and the prediction. Formally,
$$
\mu_{LS} = \left\{
\mu \in \mathbb{R} \colon
\sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} \text{ is minimized}
\right\}
$$
The function
$$
\sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2}
$$
is called a merit function. This is the target of minimization.
Least squares solution
We know how to find extrema for functions: we look for the points where the derivatives are $0$. Remember, the parameter of variation here is $\mu$.
$$
\frac{d}{d\mu} \sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} = 0
\tag{1}
$$
Sticklers may protest that this finds extrema, yet we need minima. These fears will be allayed by posting the question "How do we know that least squares solutions form a convex set?".
The derivative is
$$
\begin{align}
\frac{d}{d\mu} \sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} &= - 2 \sum_{k=1}^{m} \left( x_{k} - \mu \right)
\\ &= -2 \left ( \sum_{k=1}^{m} x_{k} - \mu \sum_{k=1}^{m} 1 \right )
\\ &= -2 \left ( \sum_{k=1}^{m} x_{k} - m \mu \right )
\end{align}
\tag{2}
$$
Using the results of $(2)$ in $(1)$ produces the answer
$$
m \mu = \sum_{k=1}^{m} x_{k}
\qquad \Rightarrow \qquad
\boxed{
\mu = \frac{1}{m} \sum_{k=1}^{m} x_{k}
}
$$
The answer is the average best typifies a set of test scores.
Not surprising, but revealing.
Example
Sample data
$$
\begin{array}{cc}
k & x\\\hline
1 & 81 \\
2 & 11 \\
3 & 78 \\
4 & 18 \\
5 & 24 \\
\end{array}
$$
Solution
The merit function, the target of minimization, is
$$
\begin{align}
\sum_{k=1}^{m} \left( x_{k} - \mu \right)^{2} &= (11-\mu )^2+(18-\mu )^2+(24-\mu )^2+(78-\mu )^2+(81-\mu )^2
\\
&= 5 \mu ^2-424 \mu +13666
\end{align}
$$
Minimizing this function of $\mu$ would not give you a moment's hesitation.
$$
\frac{d}{d\mu}\left(5 \mu ^2-424 \mu +13666\right) = -424 + 10 \mu = 0
$$
The answer is the average
$$
\mu = \frac{\sum_{k=1}^{m} x_{k}}{m} = \frac{212}{5} = 42.4
$$
Visualization
The figure on the left shows the scores for students $1-5$, with the average a dashed line. The right panel shows equation $(1)$ and how it varies with $\mu$. Hopefully, this panel illustrates why you are looking for $0$s of the first derivative.
Notice that the sum of the squares of the errors is not $0$. The sum of the squares of the errors takes the minimum value of $4677.2$ when $\mu = 42.4$.
In summary, step back from the the linear regression case, and look at this example as a problem in calculus.
Final question
Your final question
Why does a coefficient like a or b become so important? Why and how is a coefficient so prominently related to error? How does it affect anything?
opens another door to deep insight. Let's defer that answer to a new question like How stable are least squares solutions against variations in the data?
Best Answer
I think the problem is not precisely stated: its unclear what "minimize the degree of the polynomial, while fitting fairly tightly to the data" means (what is "fairly tightly?").
Anyway, for fixed degree, this may be formulated as a semidefinite program which can be solved approximately in polynomial time; there is a lot of software out there that can solve semidefinite programs very efficiently, e.g., sedumi.
Indeed, suppose your data points are $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$. Let $y$ be the vector that stacks up the $y_i$ and let $V$ be the Vandermonde matrix $$V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots& x_1^n \\ 1 & x_2 & x_2^2 & \cdots & x_2^n \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^n \end{pmatrix}$$ Then, you'd like the minimize $||y-Vp||_2^2$ subject to the constraint that the entries of the vector $p$, which we will call $p_i$, correspond to a monotonic polynomial on your domain $[a,b]$, or, in other words, $p'(x) \geq 0$ on $[a,b]$: $$ p_1 + 2p_1 x + 3 p_2^2 + \cdots n p_n x^{n-1} \geq 0 ~~~ \mbox{ for all } x \in [a,b].$$ This last constraint can be written as a semidefinite program,as outlined in these lecture notes. I will briefly outline the idea.
A univariate polynomial which is nonnegative has a representation as a sum of squares of polynomials. In particular, if $p'(x) \geq 0$ on $[a,b]$, and its degree $d$ is, say, even, then it can be written as $$p'(x) = s(x) + (x-a)(b-x) t(x),~~~~~~~~(1)$$ where $s(x), t(x)$ are sums of squares of polynomials (this is Theorem 6 in the above-linked lecture notes; a similar formula is available for odd degree). The condition that a polynomial $s(x)$ is a sum of squares is equivalent to saying there is a nonnegative definite matrix $Q$ such that $$ s(x) = \begin{pmatrix} 1 & x & \cdots x^{d/2} \end{pmatrix} Q \begin{pmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^{d/2} \end{pmatrix}.$$ This is Lemma 3 in the same lecture notes.
Putting it all together, what we optimize are the entries of the matrices $Q_1,Q_2$ that make the polynomials $s(x)=x^T Q_1 x,t(x)=x^T Q_2 x$ sums of squares, which means imposing the constraints $Q_1 \geq 0, Q_2 \geq 0$. Then Eq. (1) is a linear constraint on the entries of the matrices $Q_1,Q_2$. This gives you you have a semidefinite program you can input to your SDP solver.