The fundamental intuition is that it doesn't matter which manifold you do your calculations in, you get the same result either way.
This was already clear in the case of coordinate charts; calculus on a manifold is often defined in terms of what you get using coordinates to map the problem over to Euclidean space. The point is that the idea extends to more general manifolds than just Euclidean space.
Given a vector on $M_1$ and a scalar field on $M_2$, there are two ways you might combine them to get a directional derivative: either pull the problem back to $M_1$ or push it forward to $M_2$. The identity you cite is the one that asserts you get the same answer both ways.
Anyways, I think there is an extremely compelling algebraic rationale for this.
Suppose you are doing calculus in one variable $x$, then later you decide you need a second independent variable $y$. This changes absolutely nothing about the calculations you've done — if $y$ doesn't appear in any of your calculations, everything is as if it didn't exist at all.
I.e. $\mathrm{d} \sin(x) = \cos(x) \mathrm{d} x$ is always true; it doesn't matter whether or not you have any other variables and whether or not $x$ is dependent with any of them.
Consider the case where $\phi$ is the projection onto the first component map $\mathbb{R}^2 \to \mathbb{R}$, using standard coordinates on both.
The pullback $\phi^*$ on scalar fields is precisely the "add in the variable $y$" operation. The push forward $\phi_*$ on vectors expresses the fact only the $x$-direction matters. It's clear that if we have $v \in T\mathbb{R}^2$ and $f \in \mathcal{C}^1(\mathbb{R})$, then we expect
$$ (\phi_* v)(f) = v(\phi^* f) $$
because both formulas are expressing exactly the same operation.
Assume $\text{dim}\ M=\text{dim}\ N=n.$ In what follows, we use the Einstein convention for all sums.
The point is that if $(U_p, \phi_p)$ is a chart about $p$ in $M$ and $(V_{f(p)}, \psi_{f(p)})$ is a chart about $f(p)$ in $N$ then $(f_*)_p: T_pM \to T_{f(p)}N$ is a linear transformation, so it has a matrix representation in the coordinates defined by $\phi$ and $\psi$. If we can show that this matrix is the Jacobian of $\hat f:=\psi_{f(p)} \circ f \circ \phi_p^{-1}$, which is $\left(\frac{\partial \hat f^j}{\partial r^i}\right)_{ij},$ then $\hat f$ will be a local diffeomorphism, which is what we want.
But, $\textit{by definition},\ \frac{\partial }{\partial x^i}=(\phi_*)^{-1}\frac{\partial }{\partial r^i}$, where $(r^i)$ are the usual Euclidean coordinates. Similarly, $\frac{\partial }{\partial y^i}=(\psi_*)^{-1}\frac{\partial }{\partial s^i}$ where we use $(s^i)$ to represent the Euclidean coordinates in the range of $\hat f$ just to make the calculations easier to follow. For the same reason, we drop the subscripts $p$ and $f(p).$ Finally, we note that by the chain rule in $\mathbb R^n,\ \hat f_*\frac{\partial }{\partial r^i}=\frac{\partial \hat f^j}{\partial r^i}\frac{\partial}{\partial s^j},$ where the $(\hat f^j)$ are the components of $\hat f$. Then, we calculate
$f_*\frac{\partial }{\partial x^i}=f_*\circ (\phi_*)^{-1}\frac{\partial }{\partial r^i}=(f\circ \phi^{-1})_*\frac{\partial }{\partial r^i}=$
$(\psi^{-1}\circ \hat f)_*\frac{\partial }{\partial r^i}=(\psi_*)^{-1}\circ \hat f_*\frac{\partial }{\partial r^i}=$
$(\psi^{-1})_*\frac{\partial \hat f^j}{\partial r^i}\frac{\partial}{\partial s^j}=\frac{\partial \hat f^j}{\partial r^i}(\psi^{-1})_*\frac{\partial}{\partial s^j}=\frac{\partial \hat f^j}{\partial r^i}\frac{\partial}{\partial y^j}$.
It follows that the matrix of $f_*$ is the Jacobian of $\hat f,$ as desired.
Best Answer
We are dealing with manifolds, and we should convert all mappings and differentials between manifolds to those between open sets in Euclidean space. You've given that $\mathrm d\varphi_p$ is an isomorphism, then by choosing charts [or parametrization as M. do Carmo used] $\mathrm d\varphi_p$ is totally characterized by $\mathrm d(y^{-1} \circ \varphi \circ x)_{x^{-1}(p)}$ by definition, which is the differential of a mapping between $\mathbb R^n$ and $\mathbb R^n$. So equivalently you are given that $\mathrm d(y^{-1}\circ \varphi \circ x)_{x^{-1}(p)}\colon \mathbb R^n \to \mathbb R^n$ is an isomorphism. Now apply IFT to $(y^{-1}\circ \varphi \circ x)$.
Thus we have a local diffeomorphism $\alpha \colon U \to \alpha (U)$. From now on, no chain rules needed: $\alpha $ and $\alpha^{-1}$ are invertible and differentiable. Use this we construct $$ \psi : =y \circ \alpha \circ x^{-1} \colon x(U) \to y^{-1}(\alpha (U)). $$ This is differentiable by definition: $y^{-1} \circ \psi \circ x = \alpha $ as we shown above. Also it is invertible since $x \circ \alpha^{-1} \circ y^{-1} =: \psi^{-1}$ is a composition of bijections. This inverse $\psi^{-1}$ is differentiable since $x^{-1} \circ \psi^{-1} \circ y = \alpha^{-1}$ is.