Let $(x^i)$ be local coordinates on our base manifold $M$ and let $(x^i, \xi_j)$ be the induced coordinates on the cotangent bundle $T^* M$. Let $\pi : T^*M \to M$ be the projection $(x^i, \xi_j) \mapsto (x^i)$. It induces a $C^\infty (M)$-linear map on $1$-forms, which I will write as $\pi^* : \Omega^1 (M) \to \Omega^1 (T^* M)$. In coordinates, this sends a $1$-form $\phi = \phi_i \, \mathrm{d} x^i$ (summation convention) to $(\phi_i \circ \pi) \, \mathrm{d} x^i$. As usual this induces a $\mathbb{R}$-linear map on the fibres, namely $\pi^*_{(x, \xi)} : T^*_x M \to T^*_{(x, \xi)} (T^* M)$, sending the covector $p$ to the covector $(p, 0)$. (We must be careful and distinguish between covectors and $1$-forms here, to avoid confusion.)
The tautological $1$-form on $T^* M$ is defined to be $\pi^*_{(x, \xi)} \xi$ at each point $(x, \xi)$ in $T^* M$. Why does this formula even make sense? Well, $\xi$ by definition is an element of $T_x^* M$, so it typechecks. Thus the point $(x, \xi)$ is mapped to the covector $(\xi, 0)$ in $T^*_{(x, \xi)} (T^* M)$, and so the tautological $1$-form in coordinates is given by
$$\xi_i \, \mathrm{d} x^i$$
as claimed. (The coefficient of $\mathrm{d} \xi_j$ is $0$, of course.)
(Perhaps the reason no-one likes writing this out in full is because the tautological nature of the construction makes it quite confusing, unless one keeps track of the types of all the expressions involved.)
The best way to think about the derivative is:
\begin{equation*}
\tag{$\spadesuit$}f(x + \Delta x) \approx f(x) + f'(x) \Delta x.
\end{equation*}
The approximation is good when $\Delta x$ is small. This equation expresses the fact that $f$ is "locally linear" at $x$.
How can we make sense of ($\spadesuit$) when $f:\mathbb R^n \to \mathbb R^m$?
\begin{equation*}
f(\underbrace{x}_{n \times 1} + \underbrace{\Delta x}_{n \times 1}) \approx \underbrace{f(x)}_{m \times 1} + \underbrace{f'(x)}_{?} \underbrace{\Delta x}_{n \times 1}.
\end{equation*}
It appears that $f'(x)$ should be something that, when multiplied by an $n \times 1$ column vector, returns an $m \times 1$ column vector. In other words, $f'(x)$ should be an $m \times n$ matrix.
If we prefer to think in terms of linear transformations rather than matrices, we can write
\begin{equation*}
f(x + \Delta x) \approx f(x) + Df(x) \Delta x.
\end{equation*}
Here $Df(x)$ is a linear transformation that takes $\Delta x$ as input, and returns $f'(x) \Delta x$ as output. This equation is what it means to be "locally linear" in the multivariable case.
Taking this as our starting point, it's not too hard to show that
\begin{equation*}
f'(x) =
\begin{bmatrix}
\frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_1(x)}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m(x)}{\partial x_1} & \cdots & \frac{\partial f_m(x)}{\partial x_n}
\end{bmatrix}.
\end{equation*}
(The functions $f_i$ are the component functions of $f$.)
If
\begin{equation*}
X = \begin{bmatrix} X_1 \\ \vdots \\ X_n \end{bmatrix},
\end{equation*}
then
\begin{equation*}
f'(x) X =
\begin{bmatrix}
\sum_{j=1}^n \frac{\partial f_1(x)}{\partial x_j} X_j \\
\vdots \\
\sum_{j=1}^n \frac{\partial f_m(x)}{\partial x_j} X_j
\end{bmatrix},
\end{equation*}
as you can see just by doing the matrix-vector multiplication.
This is the equation given in your question.
Best Answer
A diffeomorphism is a smooth bijection with a smooth inverse. So if $f: M \longrightarrow N$ is a diffeomorphism, it is a smooth bijection and the inverse map $f^{-1}: N \longrightarrow M$ is smooth as well, so that $d(f^{-1})_y$ exists at all $y \in N$. $f^{-1} \circ f = \mathrm{Id}_M$ and $f \circ f^{-1} = \mathrm{Id}_N$, so we have that $$\mathrm{Id}_{T_x M} = d(f^{-1} \circ f)_x = d(f^{-1})_{f(x)} \circ df_x$$ and $$\mathrm{Id}_{T_{f(x)} N} = d(f \circ f^{-1})_{f(x)} = df_x \circ d(f^{-1})_{f(x)}$$ for any $x \in M$ (we applied the chain rule above), which implies that $df_x$ and $d(f^{-1})_{f(x)}$ are mutual inverses. Therefore $df_x$ is invertible and $(df_x)^{-1} = d(f^{-1})_{f(x)}$.