For simplicity, let us consider $\mathbb{R}^2$ and $f:\mathbb{R}^2\rightarrow \mathbb{R}^2$. Then we see that $\phi:\mathbb{R}^2\rightarrow \mathbb{R}^2$ given by
\begin{align}
\phi(x_1, x_2) =&\
\begin{pmatrix}
\phi_1(x_1, x_2)\\
\phi_2(x_1, x_2)
\end{pmatrix}\\
=&\
\begin{pmatrix}
x_1\\
x_2
\end{pmatrix}
+
\begin{pmatrix}
f_{1, x_1} (a_1, a_2) & f_{1, x_2}(a_1, a_2)\\
f_{2, x_1} (a_1, a_2) & f_{2, x_2}(a_1, a_2)
\end{pmatrix}^{-1}
\left(
\begin{pmatrix}
y_1\\
y_2
\end{pmatrix}
+
\begin{pmatrix}
f_1(x_1, x_2)\\
f_2(x_1, x_2)
\end{pmatrix}
\right)\\
=&\ \begin{pmatrix}
x_1\\
x_2
\end{pmatrix}+ \frac{1}{\det Df(a)}\begin{pmatrix}
f_{2, x_2} (a_1, a_2)f_1(x_1, x_2) -f_{1, x_2}(a_1, a_2)f_2(x_1, x_2)\\
-f_{2, x_1} (a_1, a_2)f_1(x_1, x_2)+ f_{1, x_1}(a_1, a_2)f_2(x_1, x_2)
\end{pmatrix}
+\text{ const vector}
\end{align}
Then we see that
\begin{align}
\nabla\phi(x_1, x_2) =&\
\begin{pmatrix}
\phi_{1, x_1} & \phi_{1, x_2}\\
\phi_{2, x_1} & \phi_{2, x_2}
\end{pmatrix}\\
=&\
\begin{pmatrix}
1 & 0\\
0 & 1
\end{pmatrix}
+\frac{1}{\det Df(a)}
\begin{pmatrix}
f_{2, x_2} (a_1, a_2)f_{1, x_1}(x_1, x_2) -f_{1, x_2}(a_1, a_2)f_{2, x_1}(x_1, x_2) & f_{2, x_2} (a_1, a_2)f_{1, x_2}(x_1, x_2) -f_{1, x_2}(a_1, a_2)f_{2, x_2}(x_1, x_2) \\
-f_{2, x_1} (a_1, a_2)f_{1, x_1}(x_1, x_2)+ f_{1, x_1}(a_1, a_2)f_{2, x_1}(x_1, x_2) & -f_{2, x_1} (a_1, a_2)f_{1, x_2}(x_1, x_2)+ f_{1, x_1}(a_1, a_2)f_{2, x_2}(x_1, x_2)
\end{pmatrix}\\
=&\
\begin{pmatrix}
1 & 0\\
0 & 1
\end{pmatrix}
+\frac{1}{\det Df(a)}\begin{pmatrix}
f_{2, x_2} (a_1, a_2)& -f_{1, x_2}(a_1, a_2)\\
-f_{2, x_1} (a_1, a_2)& f_{1, x_1}(a_1, a_2)
\end{pmatrix}
\begin{pmatrix}
f_{1, x_1} (x_1, x_2)& f_{1, x_2}(x_1, x_2)\\
f_{2, x_1} (x_1, x_2)& f_{2, x_2}(x_1, x_2)
\end{pmatrix}\\
=&\ I+Df(a_1, a_2)^{-1} Df(x_1, x_2).
\end{align}
Higher Dimension:
However, when $n$ is large, the above way of expanding everything out then taking partial derivatives is messy. Hence we need to compute the derivative in a more elegant manner.
In general, we see that
\begin{align}
D_x \phi(x) =&\ D_x x+ D_x [A^{-1}(y-f(x))]\\
=&\ I+ D_x[A^{-1}]\circ D_x[y-f(x)]\\
=&\ I+ A^{-1}\circ(-Df(x)) = I-Df(a)^{-1} Df(x)
\end{align}
Best Answer
What Rudin really means is this: define $$u(t)=\cases{ \frac{f(t)-f(x)}{t-x}-f'(x) & if $t \ne x$, \\ 0 & if $t = x$. }$$ for $t$ near $x$. You can see that $u(t) \to 0$ as $t \to x$ by the definition of the derivative of $f$ at $x$. Clearly, $$f(t)-f(x)=(t-x)[f'(x)+u(t)]$$ as well.