To simplify, let's discuss first what happens near $x_0 = 0$.
The way the local linear approximation works is that we are looking for the line that "best approximates" the function near $0$. This means that we want a function $y=a+bx$ with the property that the error between $y=a+bx$ and $y=f(x)$ goes to $0$ faster than $x$ approaches $0$. The first That is, we want
$$\lim_{x\to 0}\frac{f(x) - (a+bx)}{x} = 0.$$
A bit of work shows that the only $a$ and $b$ that can work are $a=f(0)$ and $b=f'(0)$:
$$\begin{align*}
\lim_{x\to 0}\frac{f(x)-(a+bx)}{x} &= \lim_{x\to 0}\frac{f(x)-f(0)+f(0)-(a+bx)}{x}\\
&= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-(a+bx)}{x}\\
&= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-a}{x} - \lim_{x\to 0}\frac{bx}{x}.
\end{align*}$$
If the function is differentiable, then the first limit is $f'(0)$; the last limit is $b$; and if we want the limit in the middle to exist, we need $f(0)=a$. So the only way this limit is equal to $0$ is if $b=f'(0)$ and $a=f(0)$.
This tells us that the line of "best approximation" to $f(x)$ near zero is $y = f(0) + f'(0)x$, which is precisely the local linear approximation.
(The fraction $\frac{f(x)-(a+bx)}{x}$ is the "relative error", because the numerator measures the error, but by dividing it by $x$ we are trying to take into account how large the quantity is; if I tell you that I measured a distance and I was off by 2 miles, you don't know if it was a very good approximation or a very bad one; it would be very good if I were measuring the distance to the moon; it would be very bad if I was measuring the length of my desk; the denominator tries to "normalize" the measurement so that it tells us how big the error is relative to how big the thing we are measuring is)
Now, the advantage of the local linear approximation is that it is very simple; the disadvantage is that it can get bad pretty quickly; for example, the local linear approximation of $y=\cos x$ at $0$ is $y=1$, which gets "bad" very quickly.
So we may want to approximate with something else which, while still easy, has a better chance of being a good approximation.
One possibility is to go from linear to quadratic: let's try to approximate $y=f(x)$ with a quadratic function, $y = a+bx+cx^2$; again, in order for the approximation to be very good, we want the error to go to zero faster than $x$ goes to $0$. And since now we have a parabola, we can require that the curvature of the parabola at $0$ be the same as the curvature of $y=f(x)$ at $0$ (this is what messes us up with $y=\cos x$: it has big curvature, but the local linear approximation is flat).
Now, for this parabola to approximate $y=f(x)$ well near $0$, we are going to want their local linear approximations to be the same (if they have different local linear approximations, then we can't expect $y=a+bx+cx^2$ to be "close to" $f(x)$). The local linear approximation to $y=a+bx+cx^2$ at $x=0$ is give by $a+bx$, so we want $a+bx = f'(0)x+f(0)$; so $a=f(0)$, and $b=f'(0)$.
To get the same curvature, we want $f''(0) = y''(0)$. Since $y''(0) = 2c$, we want $c = \frac{1}{2}f''(0)$. So our $y$ will be
$$y = f(0) + f'(0)x + \frac{f''(0)}{2}x^2.$$
In fact, this is indeed a very good approximation: it has the same value as $f(x)$ at $x=0$; the relative error goes to zero, since
$$\begin{align*}
\lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x + \frac{1}{2}f''(0)x^2)}{x} &=
\lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x)}{x} - \lim_{x\to 0}\frac{f''(0)x^2}{2x}\\
&= 0 - \lim_{x\to 0}\frac{1}{2}f''(0)x = 0.
\end{align*}$$
But more than that: the relative error between the derivatives goes to $0$:
$$\begin{align*}
\lim_{x\to 0}\frac{f'(x) - (f'(0)-f''(0)x)}{x} &= \lim_{x\to 0}\frac{f'(x)-f'(0)}{x}- f''(0)\\
&= f''(0)-f''(0) = 0.
\end{align*}$$
So not only does $y=f(0) + f'(0)x + \frac{1}{2}f''(0)x^2$ have the same value, and have the best possible relative error, the derivative is the best possible approximation to the derivative of $f(x)$; so the graph will have very similar shape near $0$.
If we then go on to a degree $3$ approximation, $y=a+bx+cx^2+dx^3$; if we want the relative error to go $0$, we are going to need $a=f(0)$, $b=f'(0)$. If we want the relative error in the derivative to go to $0$, we are going to need $c=\frac{1}{2}f''(0)$ as before. What if we want the relative error of the second derivative to go to $0$ as well? The second derivative of $y=a+bx+cx^2+dx^3$ is $2c+6dx = f''(0) + 6dx$. So we have:
$$\begin{align*}
\lim_{x\to 0}\frac{f''(x) - (f''(0)+6dx)}{x} &= \lim_{x\to 0}\frac{f''(x)-f''(0)}{x} - \lim_{x\to 0}6d\\
&= f'''(0) - 6d.
\end{align*}$$
For this to be equal to $0$, we need $6d = f'''(0)$, or $d = \frac{1}{6}f'''(0)$.
If we then go to a degree $4$ approximation and ask that the relative error, the relative error of the derivatives, the relative error of the second derivatives, and the relative errors of the third derivatives all go to $0$, then we find that we need the function $y= f(0) +f'(0)x + \frac{1}{2}f''(0)x^2 + \frac{1}{6}f'''(0)x^3 + ex^4$, where
$f^{(4)}(0) = 12e$, leading to $e = \frac{1}{12}f^{(4)}(0)$.
It is not hard to then see, inductively, that the coefficient we get for $x^n$ if we proceed this way will always be $\frac{f^{(n)}(0)}{n!}$.
You can repeat the same idea around any point $x_0$, but if you do it directly it turns out that you cannot easily use the coefficients you found for, say, the linear approximation, in the quadratic approximation; you end up with a system of equations instead of simply repeating the old coefficients. The simple way of "fixing" this is to shift everything to $0$, solve the problem at $0$, and then shift it back.
So if you want to solve the problem at $x_0$, we consider instead $g(x) = f(x+x_0)$, because then approximating $f$ near $x_0$ is the same as approximating $g$ at $0$. Moreover, $g'(x) = f'(x+x_0)(x+x_0)' = f'(x+x_0)$, $g''(x) = f''(x+x_0)$, etc. So, from the work above, we see that the degree $n$ approximation to $g$ near $0$ that has best possible relative error, best possible relative error between derivatives, best possible relative error between 2nd derivatives, etc. is
$$g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + \cdots + \frac{1}{n!}g^{(n)}(0)x^n.$$
But form the above, this is the same as
$$f(x_0) + f'(x_0)x + \frac{1}{2}f'(x_0)x^2 +\cdots + \frac{1}{n!}f^{(n)}(x_0)x^n.$$
This is the local approximation to $f(x+x_0)$ near $x=0$. Replacing $x$ with $x-x_0$, we get $f(x-x_0+x_0) = f(x)$, the "old $x$" was near $0$, so the new $x$ needs to be near $x_0$, and we get that for $x$ near $x_0$, we have
$$f(x)\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f'(x_0)(x-x_0)^2 + \cdots + \frac{1}{n!}f^{(n)}(x_0)(x-x_0)^n$$
exactly the formula for the $n$th Taylor polynomial approximation to $f(x)$ near $x_0$.
Sam L. raises the fair question of why we would want to assume that our approximations are polynomials. If you are going via the "make the relative error go to $0$; make the relative error of the relative error go to $0$", etc., you will discover that you are naturally led to polynomials.
Here's another motivation: when we encounter a function which we want to integrate and for which we are unable to find an antiderivative, we end up with two possible approaches:
Attempt to approximate the value of the integral via Riemann sums; in essence, the integral is a limit, so we can approximate the limit by using terms of the sequence whose limit we are trying to compute. This leads to left, right, midpoint, trapezoid, Simpson approximations, and other methods of numerical approximations to the integral using the function $f$.
Find a function $\mathcal{F}$ which approximates $f$ but is easy to integrate. (In fact, this is in part what is behind the Simpson rule approximation, in that it approximates the function $f$ in each subinterval by a quadratic function). Since polynomials are generally easy to integrate, trying to find polynomials that are good approximations to $f$ seems like a good idea; that is, the class of polynomials are a good proving ground to try to find "good approximations to $f$", because they are easy to integrate.
Apart from the obvious reason, that knowledge is a value in itself, there are indeed practical engineering application of multivariate calculus about order $n \geq 2$.
One example is the study of stability, which clearly benefits from Taylor expansions up to second order.
There are examples from computational methods too: if the engineer is using a Finite Element code and at some point along an analysis gets a "non positive definite Hessian", he would certainly benefit from knowing about second-order expansions to understand what is going on.
Another fitting example in my view is given by Laplace's method (alternatively, saddle-point) to approximate the solution of certain integrals, very common in Statistical Mechanics and not rare at all in Engineering. The method mandatorily requires a second order expansion.
Talking about thermodynamics, how would one prove the principle of minimum energy to an engineer not accustomed to second order expansions? With multivariate Taylor expansions, it takes just a moment, by expanding the entropy $S$ as function of an internal parameter $X$ and energy $U$ to second order, as $ \frac{\partial S}{\partial X} = 0$
$$ \mathrm{d}S \approx \frac{\partial^2 S}{\partial X^2}\mathrm{d}X^2 +\frac{\partial S}{\partial U} \mathrm{d} U $$ and the minimum nature of $U$ is on display.
Moreover, referring to elasticity as in your post, the elastic moduli are related to the strain energy function $W$ via $$ \mathcal{C} = \frac{\partial^2 W }{\partial E^2} $$ where $E$ stands for a deformation tensor, so a second-order expansion of the strain energy is needed.
In $1$ D the "first-order only" engineer could avoid the problem by looking at stress-strain curves only. Indeed, in $1D$ $$ \sigma = \frac{\mathrm{d} W}{\mathrm{d} \epsilon}$$ where $\epsilon$ stands for the strain. As for small enough strains $\sigma = E \epsilon$, in this case only first-order condsiderations are needed to estimate the elastic moduli
(in fact, by letting the experiment perform a differentiation).
But in a more general case, dealing for example with a biaxial test (and this is not an academic example for modern engineers busy with polymers), it can often be easier to start from the energy $W$, perform a multivariate expansion up to second order, and get the elastic moduli matrix out of it.
The difference stems from practical reasons: in a $1D$ test one stretches in one direction, and what happens in the other directions is easy to characterise.
In a biaxial test this is far more complex: it is difficult to impose one strain value along a given direction, while the strain along the perpendicular direction is varied. In other words, it is difficult to "perform partial derivatives" experimentally.
All this could be especially useful, and releavant, when a component is loaded by a constant stress state, on top of which small oscillations are superimposed. It is then useful to re-define the constant stress-state as the "unloaded" state, and calculate "fictitious" elastic moduli.
Best Answer
For a good geometric understanding of the difference between $\Delta r$ and $dr$, look at how a Riemann Summation becomes a Riemann Integral
The width of each column starts as a $\Delta r$ but shrinks to a $dr$ the closer the point spacing.