Calculus – Confusion About Taylor Approximation Away from Center Point

calculuspolynomialssequences-and-seriestaylor expansion

I'm trying to learn Taylor expansions and was watching a tutorial here. In the tutorial, Taylor approximation is introduced by first showing Maclaurin series, which is basically taylor series at $x=0$. The introduction seems intuitive to me:

First suppose we want to approximate a function $f(x)$ with polynomials at $x=0$ given derivative of $f(x)$ exists for any order at $x=0$, then we can start with very simple approximation at $x=0$:
\begin{equation}
f(x) \approx f(0)
\end{equation}
then add more and more higher order terms,
\begin{equation}
f(x) \approx f(0) + f'(0)x + f''(0)\frac{x^2}{2} + …
\end{equation}
this makes sense, since the RHS exactly matches the LHS for any order derivatives at $x=0$, here is a picture of taylor approximation of $sin(x)$ at $x=0$ up to order 3: image or use wolfram.

My question is, we are making approximation of $f(x)=sin(x)$ locally at/around $x=0$, but (as you can see from the plot) why as the more higher order terms being added, the approximation also become more and more like $f(x)$ even far away from $x=0$, intuitively why this is the case? Because what I understand is, the approximation is only derived locally around $x=0$. Does this also imply that with enough higher order terms Taylor expansion at $x=0$ and $x=a$ where $a\ne0$ are just the same? Furthermore, with enough terms does taylor series approximate $f(x)$ everywhere and not just $x=0$ anymore?

Best Answer

The Taylor polynomial does not always tell you the value of $x$ everywhere.

There is a radius of convergence. So if you look at the Taylor polynomial of $\ln x$ centered at $a$ it will only converge if $|x-a| < a$ which is kind of obvious as $x$ gets near zero something has to break.

You could say, that if the function is "smooth" i.e. it is infinitely differentiable, then all of its derivatives at any point tell you everything you need to know about the function everywhere inside the radius of convergence.

Anyway, how do you know that the Taylor polynomial of a function will be near that function for all $x$ in the domain?

Taylor showed that for an approximation of any degree, he could estimate the error.

Suppose $T_n(x)$ is the $n^{th}$ degree Taylor polynomial of some function $f(x)$

i.e.

$T_3(x) = x - \frac 16 x^3$ is the third degree Taylor approximation of $\sin x$ centered at 0.

This estimate has some error and that error can be bounded by a polynomial of higher degree:

$|T_k(x) - f(x)| < |M_{k+1} (x-a)^{k+1}|$

And $M_k = \frac {1}{(k+1)!} |f^{(k+1)}(\xi)|$

Where $\xi$ would be the element of the domain that maximizes the absolute value of the $(k+1)^{th}$ derivative.

If $x$ is inside the radius of convergence: $|x-a| < r \implies \lim_\limits {k\to \infty} M_k(x-a)^k = 0$

Related Solutions

[Math] Where do the higher order terms in Taylor series come from

To simplify, let's discuss first what happens near $x_0 = 0$.

The way the local linear approximation works is that we are looking for the line that "best approximates" the function near $0$. This means that we want a function $y=a+bx$ with the property that the error between $y=a+bx$ and $y=f(x)$ goes to $0$ faster than $x$ approaches $0$. The first That is, we want $$\lim_{x\to 0}\frac{f(x) - (a+bx)}{x} = 0.$$ A bit of work shows that the only $a$ and $b$ that can work are $a=f(0)$ and $b=f'(0)$: $$\begin{align*} \lim_{x\to 0}\frac{f(x)-(a+bx)}{x} &= \lim_{x\to 0}\frac{f(x)-f(0)+f(0)-(a+bx)}{x}\\ &= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-(a+bx)}{x}\\ &= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-a}{x} - \lim_{x\to 0}\frac{bx}{x}. \end{align*}$$ If the function is differentiable, then the first limit is $f'(0)$; the last limit is $b$; and if we want the limit in the middle to exist, we need $f(0)=a$. So the only way this limit is equal to $0$ is if $b=f'(0)$ and $a=f(0)$.

This tells us that the line of "best approximation" to $f(x)$ near zero is $y = f(0) + f'(0)x$, which is precisely the local linear approximation.

(The fraction $\frac{f(x)-(a+bx)}{x}$ is the "relative error", because the numerator measures the error, but by dividing it by $x$ we are trying to take into account how large the quantity is; if I tell you that I measured a distance and I was off by 2 miles, you don't know if it was a very good approximation or a very bad one; it would be very good if I were measuring the distance to the moon; it would be very bad if I was measuring the length of my desk; the denominator tries to "normalize" the measurement so that it tells us how big the error is relative to how big the thing we are measuring is)

Now, the advantage of the local linear approximation is that it is very simple; the disadvantage is that it can get bad pretty quickly; for example, the local linear approximation of $y=\cos x$ at $0$ is $y=1$, which gets "bad" very quickly.

So we may want to approximate with something else which, while still easy, has a better chance of being a good approximation.

One possibility is to go from linear to quadratic: let's try to approximate $y=f(x)$ with a quadratic function, $y = a+bx+cx^2$; again, in order for the approximation to be very good, we want the error to go to zero faster than $x$ goes to $0$. And since now we have a parabola, we can require that the curvature of the parabola at $0$ be the same as the curvature of $y=f(x)$ at $0$ (this is what messes us up with $y=\cos x$: it has big curvature, but the local linear approximation is flat).

Now, for this parabola to approximate $y=f(x)$ well near $0$, we are going to want their local linear approximations to be the same (if they have different local linear approximations, then we can't expect $y=a+bx+cx^2$ to be "close to" $f(x)$). The local linear approximation to $y=a+bx+cx^2$ at $x=0$ is give by $a+bx$, so we want $a+bx = f'(0)x+f(0)$; so $a=f(0)$, and $b=f'(0)$.

To get the same curvature, we want $f''(0) = y''(0)$. Since $y''(0) = 2c$, we want $c = \frac{1}{2}f''(0)$. So our $y$ will be $$y = f(0) + f'(0)x + \frac{f''(0)}{2}x^2.$$

In fact, this is indeed a very good approximation: it has the same value as $f(x)$ at $x=0$; the relative error goes to zero, since $$\begin{align*} \lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x + \frac{1}{2}f''(0)x^2)}{x} &= \lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x)}{x} - \lim_{x\to 0}\frac{f''(0)x^2}{2x}\\ &= 0 - \lim_{x\to 0}\frac{1}{2}f''(0)x = 0. \end{align*}$$ But more than that: the relative error between the derivatives goes to $0$: $$\begin{align*} \lim_{x\to 0}\frac{f'(x) - (f'(0)-f''(0)x)}{x} &= \lim_{x\to 0}\frac{f'(x)-f'(0)}{x}- f''(0)\\ &= f''(0)-f''(0) = 0. \end{align*}$$

So not only does $y=f(0) + f'(0)x + \frac{1}{2}f''(0)x^2$ have the same value, and have the best possible relative error, the derivative is the best possible approximation to the derivative of $f(x)$; so the graph will have very similar shape near $0$.

If we then go on to a degree $3$ approximation, $y=a+bx+cx^2+dx^3$; if we want the relative error to go $0$, we are going to need $a=f(0)$, $b=f'(0)$. If we want the relative error in the derivative to go to $0$, we are going to need $c=\frac{1}{2}f''(0)$ as before. What if we want the relative error of the second derivative to go to $0$ as well? The second derivative of $y=a+bx+cx^2+dx^3$ is $2c+6dx = f''(0) + 6dx$. So we have: $$\begin{align*} \lim_{x\to 0}\frac{f''(x) - (f''(0)+6dx)}{x} &= \lim_{x\to 0}\frac{f''(x)-f''(0)}{x} - \lim_{x\to 0}6d\\ &= f'''(0) - 6d. \end{align*}$$ For this to be equal to $0$, we need $6d = f'''(0)$, or $d = \frac{1}{6}f'''(0)$.

If we then go to a degree $4$ approximation and ask that the relative error, the relative error of the derivatives, the relative error of the second derivatives, and the relative errors of the third derivatives all go to $0$, then we find that we need the function $y= f(0) +f'(0)x + \frac{1}{2}f''(0)x^2 + \frac{1}{6}f'''(0)x^3 + ex^4$, where $f^{(4)}(0) = 12e$, leading to $e = \frac{1}{12}f^{(4)}(0)$.

It is not hard to then see, inductively, that the coefficient we get for $x^n$ if we proceed this way will always be $\frac{f^{(n)}(0)}{n!}$.

You can repeat the same idea around any point $x_0$, but if you do it directly it turns out that you cannot easily use the coefficients you found for, say, the linear approximation, in the quadratic approximation; you end up with a system of equations instead of simply repeating the old coefficients. The simple way of "fixing" this is to shift everything to $0$, solve the problem at $0$, and then shift it back.

So if you want to solve the problem at $x_0$, we consider instead $g(x) = f(x+x_0)$, because then approximating $f$ near $x_0$ is the same as approximating $g$ at $0$. Moreover, $g'(x) = f'(x+x_0)(x+x_0)' = f'(x+x_0)$, $g''(x) = f''(x+x_0)$, etc. So, from the work above, we see that the degree $n$ approximation to $g$ near $0$ that has best possible relative error, best possible relative error between derivatives, best possible relative error between 2nd derivatives, etc. is $$g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + \cdots + \frac{1}{n!}g^{(n)}(0)x^n.$$ But form the above, this is the same as $$f(x_0) + f'(x_0)x + \frac{1}{2}f'(x_0)x^2 +\cdots + \frac{1}{n!}f^{(n)}(x_0)x^n.$$ This is the local approximation to $f(x+x_0)$ near $x=0$. Replacing $x$ with $x-x_0$, we get $f(x-x_0+x_0) = f(x)$, the "old $x$" was near $0$, so the new $x$ needs to be near $x_0$, and we get that for $x$ near $x_0$, we have $$f(x)\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f'(x_0)(x-x_0)^2 + \cdots + \frac{1}{n!}f^{(n)}(x_0)(x-x_0)^n$$ exactly the formula for the $n$th Taylor polynomial approximation to $f(x)$ near $x_0$.

Sam L. raises the fair question of why we would want to assume that our approximations are polynomials. If you are going via the "make the relative error go to $0$; make the relative error of the relative error go to $0$", etc., you will discover that you are naturally led to polynomials.

Here's another motivation: when we encounter a function which we want to integrate and for which we are unable to find an antiderivative, we end up with two possible approaches:

Attempt to approximate the value of the integral via Riemann sums; in essence, the integral is a limit, so we can approximate the limit by using terms of the sequence whose limit we are trying to compute. This leads to left, right, midpoint, trapezoid, Simpson approximations, and other methods of numerical approximations to the integral using the function $f$.
Find a function $\mathcal{F}$ which approximates $f$ but is easy to integrate. (In fact, this is in part what is behind the Simpson rule approximation, in that it approximates the function $f$ in each subinterval by a quadratic function). Since polynomials are generally easy to integrate, trying to find polynomials that are good approximations to $f$ seems like a good idea; that is, the class of polynomials are a good proving ground to try to find "good approximations to $f$", because they are easy to integrate.

[Math] Is it always okay to drop higher order terms out in Taylor approximation

I think it's a combination of a lot of things. There is always simply the possibility that some terms cancel out, or that the higher order derivatives don't exist for the function you are trying to approximate.

Many distributions only consider second order interactions between variables (quadratic), and thus derivatives higher than the second end up evaluating as 0 anyway. You're example of the Fisher information as the covariance of the asymptotic distribution of the MLE is a good example. Suppose you were looking at the asymptotic distribution of the mean in a multivariate normal with known precision: $X_i \overset{iid}{\sim} N(\mu,\Omega)$.

If you calculate the Fisher information matrix, you end up getting rid of the exponential through the log, and you're left with quadratic interactions as a function of $\Omega$:

$\mathcal{I}(\mu)_{rs} = \frac{\partial^2}{\partial_{\mu_r}\partial_{\mu_s}} \frac{1}{2}\sum_{r,s}\mu_r\mu_s\Omega_{rs}$.

So you can see that if you consider higher order derivatives you end up having terms which cancel out anyway. This isn't explicitly a Taylor series, but the idea that most models and distributions consider second order interactions gives intuition about why second order approximations come up so much. The fact that second order interactions are so easy to work with also motivates people stopping at a second order approximation, as it's usually sufficient and the problem can rapidly get more difficult to solve if one expands into a higher order approximation.

There's also situations in which clever choices of the argument and point about which you expand your Taylor series can make it such that higher order terms cancel out. You see this a lot with people expanding their function $f(x)$ about local maxima or minima $x^*$, and having the second term $(x f'(x^*))$ fall out. It's very possible that higher order derivatives do exist in some regions of the domain (unlike in the previous Fisher information example), but if you specify $x^*$ conveniently you don't have to worry about it. Other ways this comes up are in limits; if you're interested in studying function $f(x)$ at $f(x+\epsilon)$ with an expansion around $x = 0$, then you end up with:

$f(\epsilon) = f(0) + \epsilon f'(0) + \sum_{n = 2}^\infty \frac{\epsilon^n f^{(n)}(0)}{n!}$

And you can see that if your examining the limiting behavior, with $\epsilon$ very small, then $\epsilon^n$ rapidly becomes trivial, and those higher order terms can be mostly dismissed.

As other people have also said, people definitely do consider the accuracy of the approximation as well, whether that be deterministically or probabilistically. You can see that on the Wikipedia page for the Delta Method: https://en.wikipedia.org/wiki/Delta_method. They briefly go into the order of the approximation, and most advanced textbook do this in general when using Taylor series.

In general I think a lot of it has to do with what people are interested in the problem as well. There's obviously a lot of connections between Taylor series and moments, because it turns what can be a very complicated function into a polynomial where moments are easier to calculate. You can see this used for example in these two pages:

https://en.wikipedia.org/wiki/First-order_second-moment_method

https://en.wikipedia.org/wiki/Taylor_expansions_for_the_moments_of_functions_of_random_variables

So going back to most interesting investigations focusing on second order interactions, or asymptotic normality (where you only need to get the first two moments), it makes sense that people stop at the quadratic term.

Best Answer

Related Solutions

[Math] Where do the higher order terms in Taylor series come from

[Math] Is it always okay to drop higher order terms out in Taylor approximation

Related Question