This is a good question, and an easy misconception!
The differential maps $T_p \mathbb{R} \to T_{f(p)} \mathbb{R}$ for each fixed point $p$. But the tangent space of $\mathbb{R}$ at any point is just $\mathbb{R}$, and this makes it easy to get confused by what is what. It's also confusing that differentials are functions of functions! Remember
$$df_{-} : \mathbb{R} \to \big ( T \mathbb{R} \to T \mathbb{R} \big )$$
That is, for every point $p \in \mathbb{R}$, we have a (distinct!) function
$df_p : T_p \mathbb{R} \to T_{f(p)} \mathbb{R}$. This outer function is allowed to be any smooth function (in particular, it can be highly nonlinear!) it is only the inner function that must be linear.
In particular, say $f(p) = 3p^3$. Then you're exactly right, $df = 9p^2 dx$. But what does this mean?
It means for any individual point $p$ we get a map $df_p : T_p \mathbb{R} \to T_{3p^3} \mathbb{R}$. And what is that map?
$$df_p(v) = (9p^2) v$$
This is just multiplication by a scalar (which is linear)! What's confusing is that the choice of scalar depends (nonlinearly) on $p$.
So, as an example:
- $df_1(v) = 9v$
- $df_2(v) = 36v$
- etc.
In general, if you have a smooth function $f : \mathbb{R} \to \mathbb{R}$, then
$df_p$ is the linear function which scales by $f'(p)$ (which is just a number).
In even more generality, if you have a smooth function $f : \mathbb{R}^n \to \mathbb{R}^m$, then you may remember we have a jacobian matrix $J$ which has functions as its entries.
Then $df = J$ is a matrix of functions, but when we fix a point $p$ we get
$df_p = \left . J \right |_p$ is a matrix with regular old numerical entries. And this $\left . J \right |_p$ is a linear map from $T_p \mathbb{R}^n \to T_{f(p)} \mathbb{R}^m$ (of course, this happens to be the same thing as $\mathbb{R}^n \to \mathbb{R}^m$, but that isn't true for arbitrary manifolds, so it's useful to keep the distinction between $\mathbb{R}^n$ and $T_p \mathbb{R}^n$ in your mind, even though they happen to be the same in this simple case).
Edit:
Let's take a highly nonlinear function like $\sin(x) : \mathbb{R} \to \mathbb{R}$. Afterwards let's take a nonlinear function from $\mathbb{R}^2 \to \mathbb{R}$ so that we can see a matrix as well.
Then $d\sin(x)_p = \cos(p)dx$. So for any fixed point $p$, say $p = \pi$, we get a linear map
$$d\sin(x)_\pi = v \mapsto \cos(\pi) v$$
that is
$$d\sin(x)_\pi = v \mapsto v$$
which is linear.
Indeed, for any point $p$ you'll get a linear map which comes from scaling $v$ by $\cos(p)$ (which, for fixed $p$, is just a number).
So $$d\sin(x)_1 \approx v \mapsto 0.54 v$$
(which is linear).
What about in higher dimensions? Let's look at
$$f(x,y) = x^2y$$
Then $df_{p} = df_{(x,y)}$ is the jacobian:
$$
df_{(x,y)}
= \left [ \frac{\partial}{\partial x} f \quad \frac{\partial}{\partial y} f \right ]
= \left [ 2xy \quad x^2 \right ]
$$
Notice the entries of this matrix are nonlinear in the choice of point
$p = (x,y)$. However, once we fix a point, say $p = (x,y) = (2,3)$:
$$
df_{(2,3)} = [12 \quad 4]
$$
which is a linear map from $T_{(2,3)}\mathbb{R}^2 \to T_{f(2,3)}\mathbb{R}$.
I hope this helps ^_^
Best Answer
Keep in mind that this is an equality only if we take into account some identifications between objects involved in the question. Since the differential is a map between tangent spaces, $g_{*f(p)}(f_{*p}(X))$ is an element of $T_{g(f(p))}\mathbb{R}$, while $X(g \circ f)$ is an element of $\mathbb{R}$, since $X$ is a tangent vector of $M$ at $p$.
What happens is that we have a natural identification between $T_{t_0}\mathbb{R}$ and $\mathbb{R}$ for any real number $t_0$ given by the map $$ t \mapsto t \cdot \frac{d}{dt}\Bigg\rvert_{t_0}. $$
What we need to show then, to understand what the book means by the equality mentioned, is that $$ g_{*f(p)}(f_{*p}(X)) = X(g \circ f) \cdot \frac{d}{dt}\Bigg\rvert_{t_0}. $$
Since $T_{g(f(p))}\mathbb{R}$ is a 1-dimensional vector space generated by $d/dt\rvert_{g(f(p))}$, we know that $$ g_{*f(p)}(f_{*p}(X)) = \alpha \cdot \frac{d}{dt}\Bigg\rvert_{g(f(p))} $$ for some constant $\alpha \in \mathbb{R}$. To find out the value of the constant, we evaluate the two sides on a certain "test function" so as to simplify the equations. In this case, we can take this test function to be the identity function $1_{\mathbb{R}}$.
To evaluate the left side, we just use the definition of the differential along with the Chain Rule $$ g_{*f(p)}(f_{*p})(X)(1_\mathbb{R}) = (g \circ f)_{*p}(X)(1_\mathbb{R}) = X(1_\mathbb{R} \circ g \circ f) = X(g \circ f). $$
To evaluate the right side you just differentiate normally. Comparing the two sides then we obtain that $\alpha = X(g \circ f)$ and so we have the equality $$ g_{*f(p)}(f_{*p}(X)) = X(g \circ f) \cdot \frac{d}{dt}\Bigg\rvert_{g(f(p))}, $$ so $g_{*f(p)}(f_{*p}(X)) = X(g \circ f)$ if we consider the identification between $T_{g(f(p))}\mathbb{R}$ and $\mathbb{R}$.
As a final note, it is important to take a look at the book you are reading to see how he defines all these objects. There are certainly many approaches to defining tangent spaces, and different approaches have different identifications and actual equalities.