I think the problem is not precisely stated: its unclear what "minimize the degree of the polynomial, while fitting fairly tightly to the data" means (what is "fairly tightly?").
Anyway, for fixed degree, this may be formulated as a semidefinite program which can be solved approximately in polynomial time; there is a lot of software out there that can solve semidefinite programs very efficiently, e.g., sedumi.
Indeed, suppose your data points are $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$. Let $y$ be the vector that stacks up the $y_i$ and let $V$ be the Vandermonde matrix $$V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots& x_1^n \\ 1 & x_2 & x_2^2 & \cdots & x_2^n \\
\vdots & \vdots & \vdots & & \vdots \\
1 & x_n & x_n^2 & \cdots & x_n^n \end{pmatrix}$$ Then, you'd like the minimize $||y-Vp||_2^2$ subject to the constraint that the entries of the vector $p$, which we will call $p_i$, correspond to a monotonic polynomial on your domain $[a,b]$, or, in other words, $p'(x) \geq 0$ on $[a,b]$:
$$ p_1 + 2p_1 x + 3 p_2^2 + \cdots n p_n x^{n-1} \geq 0 ~~~ \mbox{ for all } x \in [a,b].$$
This last constraint can be written as a semidefinite program,as outlined in these lecture notes. I will briefly outline the idea.
A univariate polynomial which is nonnegative has a representation as a sum of squares of polynomials. In particular, if $p'(x) \geq 0$ on $[a,b]$, and its degree $d$ is, say, even, then it can be written as $$p'(x) = s(x) + (x-a)(b-x) t(x),~~~~~~~~(1)$$ where $s(x), t(x)$ are sums of squares of polynomials (this is Theorem 6 in the above-linked lecture notes; a similar formula is available for odd degree). The condition that a polynomial $s(x)$ is a sum of squares is equivalent to saying there is a nonnegative definite matrix $Q$ such that $$ s(x) = \begin{pmatrix} 1 & x & \cdots x^{d/2} \end{pmatrix} Q \begin{pmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^{d/2} \end{pmatrix}.$$ This is Lemma 3 in the same lecture notes.
Putting it all together, what we optimize are the entries of the matrices $Q_1,Q_2$ that make the polynomials $s(x)=x^T Q_1 x,t(x)=x^T Q_2 x$ sums of squares, which means imposing the constraints $Q_1 \geq 0, Q_2 \geq 0$. Then Eq. (1) is a linear constraint on the entries of the matrices $Q_1,Q_2$. This gives you you have a semidefinite program you can input to your SDP solver.
You want to fit the model
$$y=21-e^{a x}\tag 1$$
For sure, you can have an extimate writing
$$21-y=e^{a x}\implies \log(21-y)=ax\implies z=a x\tag 2$$ and a preliminary linear regression gives $a=0.147233$ (just as you did).
In fact, you do not need to use regression since you can get $a$ directly from the normal equation
$$a=\frac{\sum_{i=1}^n x_iz_i } { \sum_{i=1}^n x_i^2 }=\frac{\sum_{i=1}^n x_i \log(21-y_i)} { \sum_{i=1}^n x_i^2 }$$
But this is only the preliminary step since what is measured is $y$ and not $\log(21-y)$. So, you need to continue with a nonlinear regression using this estimate. This would lead to $a=0.149140$.
Let us compare the results for $y$ using both models
$$\left(
\begin{array}{cccc}
x & y & (2) & (1) \\
0 & 20 & 20.0000 & 20.0000 \\
1 & 20 & 19.8414 & 19.8392 \\
2 & 20 & 19.6576 & 19.6525 \\
3 & 20 & 19.4447 & 19.4357 \\
4 & 20 & 19.1979 & 19.1841 \\
5 & 20 & 18.9121 & 18.8921 \\
6 & 18 & 18.5809 & 18.5531 \\
7 & 18 & 18.1972 & 18.1595 \\
8 & 18 & 17.7526 & 17.7027 \\
9 & 18 & 17.2374 & 17.1723 \\
10 & 16 & 16.6406 & 16.5567 \\
11 & 16 & 15.9491 & 15.8421 \\
12 & 14 & 15.1479 & 15.0125 \\
13 & 14 & 14.2196 & 14.0495 \\
14 & 12 & 13.1441 & 12.9316 \\
15 & 12 & 11.8980 & 11.6339 \\
16 & 10 & 10.4542 & 10.1275 \\
17 & 8 & 8.78136 & 8.37881 \\
18 & 6 & 6.84319 & 6.34887 \\
19 & 4 & 4.59758 & 3.99245 \\
20 & 2 & 1.99576 & 1.25704
\end{array}
\right)$$
Using model $(2)$ and back to the $y$'s, the sum of squares is $8.28$ while using model $(1)$ lead to a sum of squares equal to $6.66$ which is quite better.
Moreover, it is interesting to look at the statistics.
For model $(2)$, we have
$$\begin{array}{clclclclc}
\text{} & \text{Estimate} & \text{Standard Error} & \text{Confidence Interval} \\
a & 0.147233 & 0.005034 & \{0.136698,0.157769\} \\
\end{array}$$ while for model $(1)$
$$\begin{array}{clclclclc}
\text{} & \text{Estimate} & \text{Standard Error} & \text{Confidence Interval} \\
a & 0.149140 & 0.000873 & \{0.147312,0.150967\} \\
\end{array}$$ showing that, using the "true" model, the standard error is basically divided by a factor of almost $6$.
If you do not want to use nonlinear regression, you could use Excel to solve for $a$ the equation
$$f(a)=\sum_{i=1}^n e^{ax_i}\left(21-e^{ax_i}-y_i \right)=0$$ strating from the preliminary guess. Even graphing the function could be sufficient.
For solving the equation, you could also use Newton method
$$f'(a)=a\sum_{i=1}^n e^{ax_i}\left(21-2e^{ax_i}-y_i \right)$$ and use
$$a_{n+1}=a_n-\frac{f(a_n)}{f'(a_n)}$$ using for $a_0$ the value obtained from the preliminary step.
For your problem, Newton iterates would be
$$\left(
\begin{array}{cc}
n & a_n \\
0 & 0.1472330000 \\
1 & 0.1492437955 \\
2 & 0.1491401458 \\
3 & 0.1491398530
\end{array}
\right)$$
Edit
If we consider the data set outside its specific context, we could have obtained a better fit using
$$y=a-b\, e^{cx}\tag 3$$ which leads to a sum of squares equal to $4.97$ with the following parameters
$$\begin{array}{clclclclc}
\text{} & \text{Estimate} & \text{Standard Error} & \text{Confidence Interval} \\
a & 22.1098 & 0.5276 & \{20.9968,23.2229\} \\
b & 1.57255 & 0.3101 & \{0.91830,2.22680\} \\
c & 0.12823 & 0.0092 & \{0.10875,0.14771\} \\
\end{array}$$ leading to the following results
$$\left(
\begin{array}{ccc}
x & y & (3) \\
0 & 20 & 20.5373 \\
1 & 20 & 20.3221 \\
2 & 20 & 20.0775 \\
3 & 20 & 19.7995 \\
4 & 20 & 19.4834 \\
5 & 20 & 19.1241 \\
6 & 18 & 18.7156 \\
7 & 18 & 18.2513 \\
8 & 18 & 17.7234 \\
9 & 18 & 17.1233 \\
10 & 16 & 16.4410 \\
11 & 16 & 15.6655 \\
12 & 14 & 14.7838 \\
13 & 14 & 13.7816 \\
14 & 12 & 12.6422 \\
15 & 12 & 11.3469 \\
16 & 10 & 9.87440 \\
17 & 8 & 8.20046 \\
18 & 6 & 6.29750 \\
19 & 4 & 4.13420 \\
20 & 2 & 1.67494
\end{array}
\right)$$
Best Answer
AIC or BIC is the right path to go. These criteria helps you to determine the model that best approximates the generating mechanism. Another, more basic approach is Ramsey RESET test for model misspecification. For example, see here http://lipas.uwasa.fi/~sjp/Teaching/ecm/lectures/ecmc8.pdf