As some people on this site might be aware I don't always take downvotes well. So here's my attempt to provide more context to my answer for whoever decided to downvote.
Note that I will confine my discussion to functions $f: D\subseteq \Bbb R \to \Bbb R$ and to ideas that should be simple enough for anyone who's taken a course in scalar calculus to understand. Let me know if I haven't succeeded in some way.
First, it'll be convenient for us to define a new notation. It's called "little oh" notation.
Definition: A function $f$ is called little oh of $g$ as $x\to a$, denoted $f\in o(g)$ as $x\to a$, if
$$\lim_{x\to a}\frac {f(x)}{g(x)}=0$$
Intuitively this means that $f(x)\to 0$ as $x\to a$ "faster" than $g$ does.
Here are some examples:
- $x\in o(1)$ as $x\to 0$
- $x^2 \in o(x)$ as $x\to 0$
- $x\in o(x^2)$ as $x\to \infty$
- $x-\sin(x)\in o(x)$ as $x\to 0$
- $x-\sin(x)\in o(x^2)$ as $x\to 0$
- $x-\sin(x)\not\in o(x^3)$ as $x\to 0$
Now what is an affine approximation? (Note: I prefer to call it affine rather than linear -- if you've taken linear algebra then you'll know why.) It is simply a function $T(x) = A + Bx$ that approximates the function in question.
Intuitively it should be clear which affine function should best approximate the function $f$ very near $a$. It should be $$L(x) = f(a) + f'(a)(x-a).$$ Why? Well consider that any affine function really only carries two pieces of information: slope and some point on the line. The function $L$ as I've defined it has the properties $L(a)=f(a)$ and $L'(a)=f'(a)$. Thus $L$ is the unique line which passes through the point $(a,f(a))$ and has the slope $f'(a)$.
But we can be a little more rigorous. Below I give a lemma and a theorem that tell us that $L(x) = f(a) + f'(a)(x-a)$ is the best affine approximation of the function $f$ at $a$.
Lemma: If a differentiable function $f$ can be written, for all $x$ in some neighborhood of $a$, as $$f(x) = A + B\cdot(x-a) + R(x-a)$$ where $A, B$ are constants and $R\in o(x-a)$, then $A=f(a)$ and $B=f'(a)$.
Proof: First notice that because $f$, $A$, and $B\cdot(x-a)$ are continuous at $x=a$, $R$ must be too. Then setting $x=a$ we immediately see that $f(a)=A$.
Then, rearranging the equation we get (for all $x\ne a$)
$$\frac{f(x)-f(a)}{x-a} = \frac{f(x)-A}{x-a} = \frac{B\cdot (x-a)+R(x-a)}{x-a} = B + \frac{R(x-a)}{x-a}$$
Then taking the limit as $x\to a$ we see that $B=f'(a)$. $\ \ \ \square$
Theorem: A function $f$ is differentiable at $a$ iff, for all $x$ in some neighborhood of $a$, $f(x)$ can be written as
$$f(x) = f(a) + B\cdot(x-a) + R(x-a)$$ where $B \in \Bbb R$ and $R\in o(x-a)$.
Proof: "$\implies$": If $f$ is differentiable then $f'(a) = \lim_{x\to a} \frac{f(x)-f(a)}{x-a}$ exists. This can alternatively be written $$f'(a) = \frac{f(x)-f(a)}{x-a} + r(x-a)$$ where the "remainder function" $r$ has the property $\lim_{x \to a} r(x-a)=0$. Rearranging this equation we get $$f(x) = f(a) + f'(a)(x-a) -r(x-a)(x-a).$$ Let $R(x-a):= -r(x-a)(x-a)$. Then clearly $R\in o(x-a)$ (confirm this for yourself). So $$f(x) = f(a) + f'(a)(x-a) + R(x-a)$$ as required.
"$\impliedby$": Simple rearrangement of this equation yields
$$B + \frac{R(x-a)}{x-a}= \frac{f(x)-f(a)}{x-a}.$$ The limit as $x\to a$ of the LHS exists and thus the limit also exists for the RHS. This implies $f$ is differentiable by the standard definition of differentiability. $\ \ \ \square$
Taken together the above lemma and theorem tell us that not only is $L(x) = f(a) + f'(a)(x-a)$ the only affine function who's remainder tends to $0$ as $x\to a$ faster than $x-a$ itself (this is the sense in which this approximation is the best), but also that we can even define the concept differentiability by the existence of this best affine approximation.
To keep things simple, let's stick to $3$-dimensions, so it is like familiar "ordinary vector calculus".
If $\sigma(t) = (x(t),y(t),z(t))$ is a curve, then its tangent vector is $\sigma'(t) = (x'(t),y'(t),z'(t))$. Let $p=\sigma(0)$, and $v = \sigma'(0)$.
By the chain rule from "ordinary vector calculus", the right-hand side of the equation you give above is:
$$ \frac{d}{dt} f(\sigma(t)) = \frac{d}{dt}f(x(t),y(t),z(t)) = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} + \frac{\partial f}{\partial z}\frac{dz}{dt} $$
and then of course evaluate at $t=0$. The left-most term in my equation above, $\left. \frac{d}{dt}f(\sigma(t)) \right|_{t=0}$, is by assumption the vector $v$ acting on the function $f$. On the other hand, the right-most part of my equation above looks like a dot product:
$$ \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} + \frac{\partial f}{\partial z}\frac{dz}{dt} = \nabla f \cdot \sigma' $$
This is also the "ordinary vector calculus" definition of the directional derivative in the direction $v$:
$$ D_v f(p) = \nabla f(p) \cdot v $$
Best Answer
I will try to answer this question myself (I don't know if this is the "right" answer, but I will just throw it here as its better than nothing)
What I want is to derive the concept of derivative & differential by only using the concept of limit and linear approximation. As I mentioned in my [Step 3], if I just want to approximate the value of $f(x)$ by the linear equation: $$f(x)=f(a)+A(x-a)+E=f(a)+A\Delta x+E\ \ \ \ \ \ (1)$$ then there is infinite choices of $A$ for me to pick.
So the question becomes: What kind of $A$ do I actually want ? or What kind of $A$ is "nice" enough ?
Now there are two goals of linear approximation (which i want $A$ to satisfy):
[Consider the first goal above], "small" with respect to what ? There are three terms on the right side of equation (1), because $f(a)$ is a constant so the only two terms affect the accuracy of my approximation is $A\Delta x$ and $E$, that is to say I want $E$ to be "small" with respect to $A\Delta x$. This means the value of fraction $$\frac{E}{A\Delta x}\ \text{is very small}.$$
[Consider the second goal above], I realize that $\frac{E}{A\Delta x}$ cannot always be very small. What I am looking for is an operation that will make it smaller (or larger) so I can tell someone or a computer what to do (or not to do) to improve the accuracy.
Now as the value of $E$ is dependent on $\Delta x$ for different choices of $x$, the only two options I have would be let $\Delta x \to 0$ or $\Delta x \to \infty$. Because any other options will likely involve letting $\Delta x$ be some kind of complicated function of $x$, and not only it defeats the pupose of linear approximation (for example, assume $\Delta x$ is a parabola function with respect to $x$, then why bother doing linear approximation at the first place ? I should just do a parabolic approximation !) but also this will likely involve more than one operation during approximation (which is not good if there are too many operations a person or a computer needs to carry).
So now I need to evaulate the two operations above:
With the above two consideration, I should try to develop the requirement of $A$:
Assume $\lim\limits_{\Delta x \to 0}\frac{E}{A\Delta x}$ exists, what I want is $\frac{E}{A\Delta x}$ become smaller as $\Delta x \to 0$. This means at best I should expect this limit goes to zero (and the "nicest $A$" should at least satisfies the value of this limit), that is: $$\lim\limits_{\Delta x \to 0}\frac{E}{A\Delta x}=\frac{1}{A}\lim\limits_{\Delta x \to 0}\frac{E}{\Delta x}=0$$ $$\lim\limits_{\Delta x \to 0}\frac{E}{\Delta x}=0$$ Now I can say $E$ is a higher order inifitesimal of $\Delta x$. I can then express $E$ with the following equation $$E=\epsilon\Delta x\text{, where} \lim\limits_{\Delta x \to 0}\epsilon=0$$ Substitute $E=\epsilon\Delta x$ back to the equation (1) above, I have $$f(x)=f(a)+A\Delta x+\epsilon\Delta x$$ $$A+\epsilon=\frac{f(x)-f(a)}{\Delta x}$$ And it is not hard to see that $$\lim\limits_{\Delta x \to 0}(A+\epsilon)=\lim\limits_{\Delta x \to 0}\frac{f(x)-f(a)}{\Delta x}$$ $$\lim\limits_{\Delta x \to 0}A + \lim\limits_{\Delta x \to 0}\epsilon=\lim\limits_{\Delta x \to 0}\frac{f(x)-f(a)}{\Delta x}$$ $$A=\lim\limits_{\Delta x \to 0}\frac{f(x)-f(a)}{\Delta x}$$ Now I can define $A$ to be the derivative, and $A\Delta x$, $\Delta x$ to be the differential. I can also claim for each $(x,f(x))$ in the interval, such $A$ is unique due to the uniqueness nature of limit $\lim\limits_{\Delta x \to 0}\frac{f(x)-f(a)}{\Delta x}$.
(I can also claim now this unique $A$ is the "nicest $A$" because it is the only one that satisfies the least requirement of the "nicest $A$")
Thus I successfuly bring in the concept of derivative and differential by only using the concept of limit and linear approximation. And the function is said to be differentiable when the limit $$\lim\limits_{\Delta x \to 0}\frac{E}{\Delta x}=0\ \ \text{exists}$$