I think the problem is not precisely stated: its unclear what "minimize the degree of the polynomial, while fitting fairly tightly to the data" means (what is "fairly tightly?").
Anyway, for fixed degree, this may be formulated as a semidefinite program which can be solved approximately in polynomial time; there is a lot of software out there that can solve semidefinite programs very efficiently, e.g., sedumi.
Indeed, suppose your data points are $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$. Let $y$ be the vector that stacks up the $y_i$ and let $V$ be the Vandermonde matrix $$V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots& x_1^n \\ 1 & x_2 & x_2^2 & \cdots & x_2^n \\
\vdots & \vdots & \vdots & & \vdots \\
1 & x_n & x_n^2 & \cdots & x_n^n \end{pmatrix}$$ Then, you'd like the minimize $||y-Vp||_2^2$ subject to the constraint that the entries of the vector $p$, which we will call $p_i$, correspond to a monotonic polynomial on your domain $[a,b]$, or, in other words, $p'(x) \geq 0$ on $[a,b]$:
$$ p_1 + 2p_1 x + 3 p_2^2 + \cdots n p_n x^{n-1} \geq 0 ~~~ \mbox{ for all } x \in [a,b].$$
This last constraint can be written as a semidefinite program,as outlined in these lecture notes. I will briefly outline the idea.
A univariate polynomial which is nonnegative has a representation as a sum of squares of polynomials. In particular, if $p'(x) \geq 0$ on $[a,b]$, and its degree $d$ is, say, even, then it can be written as $$p'(x) = s(x) + (x-a)(b-x) t(x),~~~~~~~~(1)$$ where $s(x), t(x)$ are sums of squares of polynomials (this is Theorem 6 in the above-linked lecture notes; a similar formula is available for odd degree). The condition that a polynomial $s(x)$ is a sum of squares is equivalent to saying there is a nonnegative definite matrix $Q$ such that $$ s(x) = \begin{pmatrix} 1 & x & \cdots x^{d/2} \end{pmatrix} Q \begin{pmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^{d/2} \end{pmatrix}.$$ This is Lemma 3 in the same lecture notes.
Putting it all together, what we optimize are the entries of the matrices $Q_1,Q_2$ that make the polynomials $s(x)=x^T Q_1 x,t(x)=x^T Q_2 x$ sums of squares, which means imposing the constraints $Q_1 \geq 0, Q_2 \geq 0$. Then Eq. (1) is a linear constraint on the entries of the matrices $Q_1,Q_2$. This gives you you have a semidefinite program you can input to your SDP solver.
If you have the curve, then geometrically, that is all you need to find a derivative value at a given point. You could estimate the direction of the tangent line at a given $x=a$. The slope of that tangent line is the value of $f'(a)$.
If you have a table of values, let's say you know $f(2.9), f(3), f(3.1)$, etc., but perhaps you have no info about $f(3.05)$. Then you can still estimate $f'(3)$ (in this what-if), by calculating the average rate of change over the smallest interval available in the data. For example, $f'(3) \approx \frac{f(3.1) - f(3)}{0.1} \approx \frac{f(3) - f(2.9)}{0.1}$. Perhaps a better estimate can be had by averaging those two to get: $f'(3) \approx \frac{f(3.1) - f(2.9)}{0.2}$.
Hope this helps!
Best Answer
Numerical derivative looks reasonable to me. You can choose a one-sided finite difference $y'(n)\approx \frac{x(n+h)-x(n)}{h}$ or a symmetric one, $y'(n)\approx \frac{x(n+h)-x(n-h)}{2h}$. The second one preserves symmetry and is better for most situations where you only need a derivative, not for solving equations. The one-sided can use the left or the right neighbour.
The derivative always increases the noise but it is misleading to interpolate! Interpolation only makes up information that isn't there. You can get data that no longer satisfies correct statistical properties, inequalities and can lead to misuse.
It makes a bit more sense to smooth after taking the discrete derivative. The entire information is still there. If you know that high frequency noise is not a physical thing, you can filter later: any low-pass filter can be used (either by running average or any builtin smoothing methods).
If you have periodic data, or if you have enough zeros on the left and the right (or if you premultiply with a window function), you can use fourier transform to take the derivative. Perform FFT, and in the Fourier space, differentiation equals multiplication by frequency and imaginary unit. Then just transform back. This is a bit better in a sense, that doesn't really treat the derivative as a local operator, but takes into account the "big picture" and thus may be a bit more stable. This is a good method if you want to filter, because you can do it both at the same time: just multiply by the frequency filter before inverting the transform.