To simplify, let's discuss first what happens near $x_0 = 0$.
The way the local linear approximation works is that we are looking for the line that "best approximates" the function near $0$. This means that we want a function $y=a+bx$ with the property that the error between $y=a+bx$ and $y=f(x)$ goes to $0$ faster than $x$ approaches $0$. The first That is, we want
$$\lim_{x\to 0}\frac{f(x) - (a+bx)}{x} = 0.$$
A bit of work shows that the only $a$ and $b$ that can work are $a=f(0)$ and $b=f'(0)$:
$$\begin{align*}
\lim_{x\to 0}\frac{f(x)-(a+bx)}{x} &= \lim_{x\to 0}\frac{f(x)-f(0)+f(0)-(a+bx)}{x}\\
&= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-(a+bx)}{x}\\
&= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-a}{x} - \lim_{x\to 0}\frac{bx}{x}.
\end{align*}$$
If the function is differentiable, then the first limit is $f'(0)$; the last limit is $b$; and if we want the limit in the middle to exist, we need $f(0)=a$. So the only way this limit is equal to $0$ is if $b=f'(0)$ and $a=f(0)$.
This tells us that the line of "best approximation" to $f(x)$ near zero is $y = f(0) + f'(0)x$, which is precisely the local linear approximation.
(The fraction $\frac{f(x)-(a+bx)}{x}$ is the "relative error", because the numerator measures the error, but by dividing it by $x$ we are trying to take into account how large the quantity is; if I tell you that I measured a distance and I was off by 2 miles, you don't know if it was a very good approximation or a very bad one; it would be very good if I were measuring the distance to the moon; it would be very bad if I was measuring the length of my desk; the denominator tries to "normalize" the measurement so that it tells us how big the error is relative to how big the thing we are measuring is)
Now, the advantage of the local linear approximation is that it is very simple; the disadvantage is that it can get bad pretty quickly; for example, the local linear approximation of $y=\cos x$ at $0$ is $y=1$, which gets "bad" very quickly.
So we may want to approximate with something else which, while still easy, has a better chance of being a good approximation.
One possibility is to go from linear to quadratic: let's try to approximate $y=f(x)$ with a quadratic function, $y = a+bx+cx^2$; again, in order for the approximation to be very good, we want the error to go to zero faster than $x$ goes to $0$. And since now we have a parabola, we can require that the curvature of the parabola at $0$ be the same as the curvature of $y=f(x)$ at $0$ (this is what messes us up with $y=\cos x$: it has big curvature, but the local linear approximation is flat).
Now, for this parabola to approximate $y=f(x)$ well near $0$, we are going to want their local linear approximations to be the same (if they have different local linear approximations, then we can't expect $y=a+bx+cx^2$ to be "close to" $f(x)$). The local linear approximation to $y=a+bx+cx^2$ at $x=0$ is give by $a+bx$, so we want $a+bx = f'(0)x+f(0)$; so $a=f(0)$, and $b=f'(0)$.
To get the same curvature, we want $f''(0) = y''(0)$. Since $y''(0) = 2c$, we want $c = \frac{1}{2}f''(0)$. So our $y$ will be
$$y = f(0) + f'(0)x + \frac{f''(0)}{2}x^2.$$
In fact, this is indeed a very good approximation: it has the same value as $f(x)$ at $x=0$; the relative error goes to zero, since
$$\begin{align*}
\lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x + \frac{1}{2}f''(0)x^2)}{x} &=
\lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x)}{x} - \lim_{x\to 0}\frac{f''(0)x^2}{2x}\\
&= 0 - \lim_{x\to 0}\frac{1}{2}f''(0)x = 0.
\end{align*}$$
But more than that: the relative error between the derivatives goes to $0$:
$$\begin{align*}
\lim_{x\to 0}\frac{f'(x) - (f'(0)-f''(0)x)}{x} &= \lim_{x\to 0}\frac{f'(x)-f'(0)}{x}- f''(0)\\
&= f''(0)-f''(0) = 0.
\end{align*}$$
So not only does $y=f(0) + f'(0)x + \frac{1}{2}f''(0)x^2$ have the same value, and have the best possible relative error, the derivative is the best possible approximation to the derivative of $f(x)$; so the graph will have very similar shape near $0$.
If we then go on to a degree $3$ approximation, $y=a+bx+cx^2+dx^3$; if we want the relative error to go $0$, we are going to need $a=f(0)$, $b=f'(0)$. If we want the relative error in the derivative to go to $0$, we are going to need $c=\frac{1}{2}f''(0)$ as before. What if we want the relative error of the second derivative to go to $0$ as well? The second derivative of $y=a+bx+cx^2+dx^3$ is $2c+6dx = f''(0) + 6dx$. So we have:
$$\begin{align*}
\lim_{x\to 0}\frac{f''(x) - (f''(0)+6dx)}{x} &= \lim_{x\to 0}\frac{f''(x)-f''(0)}{x} - \lim_{x\to 0}6d\\
&= f'''(0) - 6d.
\end{align*}$$
For this to be equal to $0$, we need $6d = f'''(0)$, or $d = \frac{1}{6}f'''(0)$.
If we then go to a degree $4$ approximation and ask that the relative error, the relative error of the derivatives, the relative error of the second derivatives, and the relative errors of the third derivatives all go to $0$, then we find that we need the function $y= f(0) +f'(0)x + \frac{1}{2}f''(0)x^2 + \frac{1}{6}f'''(0)x^3 + ex^4$, where
$f^{(4)}(0) = 12e$, leading to $e = \frac{1}{12}f^{(4)}(0)$.
It is not hard to then see, inductively, that the coefficient we get for $x^n$ if we proceed this way will always be $\frac{f^{(n)}(0)}{n!}$.
You can repeat the same idea around any point $x_0$, but if you do it directly it turns out that you cannot easily use the coefficients you found for, say, the linear approximation, in the quadratic approximation; you end up with a system of equations instead of simply repeating the old coefficients. The simple way of "fixing" this is to shift everything to $0$, solve the problem at $0$, and then shift it back.
So if you want to solve the problem at $x_0$, we consider instead $g(x) = f(x+x_0)$, because then approximating $f$ near $x_0$ is the same as approximating $g$ at $0$. Moreover, $g'(x) = f'(x+x_0)(x+x_0)' = f'(x+x_0)$, $g''(x) = f''(x+x_0)$, etc. So, from the work above, we see that the degree $n$ approximation to $g$ near $0$ that has best possible relative error, best possible relative error between derivatives, best possible relative error between 2nd derivatives, etc. is
$$g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + \cdots + \frac{1}{n!}g^{(n)}(0)x^n.$$
But form the above, this is the same as
$$f(x_0) + f'(x_0)x + \frac{1}{2}f'(x_0)x^2 +\cdots + \frac{1}{n!}f^{(n)}(x_0)x^n.$$
This is the local approximation to $f(x+x_0)$ near $x=0$. Replacing $x$ with $x-x_0$, we get $f(x-x_0+x_0) = f(x)$, the "old $x$" was near $0$, so the new $x$ needs to be near $x_0$, and we get that for $x$ near $x_0$, we have
$$f(x)\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f'(x_0)(x-x_0)^2 + \cdots + \frac{1}{n!}f^{(n)}(x_0)(x-x_0)^n$$
exactly the formula for the $n$th Taylor polynomial approximation to $f(x)$ near $x_0$.
Sam L. raises the fair question of why we would want to assume that our approximations are polynomials. If you are going via the "make the relative error go to $0$; make the relative error of the relative error go to $0$", etc., you will discover that you are naturally led to polynomials.
Here's another motivation: when we encounter a function which we want to integrate and for which we are unable to find an antiderivative, we end up with two possible approaches:
Attempt to approximate the value of the integral via Riemann sums; in essence, the integral is a limit, so we can approximate the limit by using terms of the sequence whose limit we are trying to compute. This leads to left, right, midpoint, trapezoid, Simpson approximations, and other methods of numerical approximations to the integral using the function $f$.
Find a function $\mathcal{F}$ which approximates $f$ but is easy to integrate. (In fact, this is in part what is behind the Simpson rule approximation, in that it approximates the function $f$ in each subinterval by a quadratic function). Since polynomials are generally easy to integrate, trying to find polynomials that are good approximations to $f$ seems like a good idea; that is, the class of polynomials are a good proving ground to try to find "good approximations to $f$", because they are easy to integrate.
Best Answer
The proof is not correct. If a function if $C^{\infty}$ at $x_0$, that doesn't mean that its Taylor series converges fo $f$ in any neighborhood of $x_0$. For example, take
$$f(x)=\begin{cases}e^{-1/x^2}, & \text{if } \ x \neq 0; \\ 0,& \text{if } \ x=0\end{cases}.$$
It can be proven that $f^{k}(0)=0$ for every $k \in \Bbb{N}$. Therefore, the Taylor series of $f$ centered at $0$ converges in all real line, but of course, converges to the constan $0$, and not to $f$.
If the Taylor series of a $C^{\infty}$ function $f$ about $x_0$ converges to $f$ in some neighborhood fo $x_0$ the function $f$ is said to be analytic at $x_0$. Thus all analytic funtcions are $C^{\infty}$, but the converse is not true.
About the books: Rudin's Principles of Mathematical Analysis is the standard reference. There is also Royden's Real Analysis and Tao's Analysis I. If you know a little bit of Portuguese (or even Spanish would do) I strongly recommend Elon Lages' Curso de AnĂ¡lise, vol. I.