Note that, since your domain is convex, you can join any two points $x, y$ by the line segment $c(t) = x + t(y-x)$.
Then
$$f(y)-f(x)=\int_0^1 df(x+t(y-x))(y-x)\, dt$$
Now mulitply this with $y-x$, use the fact that $df$ is positive and the integral linear, to conclude that this is not $=0$ if $y\neq x$.
It's just the matrix multiplication of the two Jacobians. In fact, this can be seen as the reason why matrix multiplication is defined like that. If you see a $m\times n$ matrix as a linear function from $\mathbb R^n \to \mathbb R^m$, then the Jacobian of the matrix is the matrix itself. The matrix multiplication is just the composition of the two linear functions, so the Jacobian of the composed function is just the matrix multiplication of the two Jacobians.
Locally (that is, in a infinitesimal neighborhood), all differentiable functions can be seen as a linear function of the derivatives of the inputs, therefore you can use matrices to describe the local behavior of all multivariable differentiable functions.
Let's write the components of the function $f$ as $f^\mu$ where $\mu$ is an index that goes from $1$ to $m$. Similarly, you can write $g$ as $g^\mu$. In this way, can write a vector as a symbol with indices but treat them as numbers, (as in, $f^1$ is the first element of the result of $f(x)$, which is in $\mathbb R$.
Then you can write the partial derivative $\frac{\partial f^\mu}{\partial x^\nu}$ as $f^\mu{}_{,\nu}$, mind the comma. It is easy to see that the $f^\mu{}_{,\nu}$ is exactly the Jacobian of $f$, as in $f^1{}_{,2} \equiv \partial f^1/\partial x_2$, is the $(1,2)$-th element of the Jacobian. These notations are called the Ricci Calculus.
The chain rule says:
$$
\left(\frac{\partial}{\partial x^\nu}g(f(x))\right)^\mu=(g\circ f)^\mu{}_{,\nu}=\sum_\gamma g^\mu{}_{,\gamma}f^\gamma{}_{,\nu}\equiv g^\mu{}_{,\gamma}f^\gamma{}_{,\nu}
$$
The last one is to simplify the notation of the summation using Einstein notation, invented by Einstein for manipulating tensors, which is a part of Ricci Calculus as well.
You can see that $\mu$ goes from $1$ to $o$, $\gamma$ goes from $1$ to $m$, and $\nu$ goes from $1$ to $l$. This immediately gives the matrix multiplication of the two Jacobians.
Best Answer
One idea is as follows. If $f$ and $g$ are differentiable, then by definition it exists a best linear approximation which is:
$$ g(x+h)=g(x)+dg[x](h)+o(\|h\|) $$
where $dg[x](h)$ denotes the action of the linear application $dg[x]$ on the vector $h$.
The notation $o(\|h\|)$ means that $$\lim\limits_{h\to 0} \frac{o(\|h\|)}{\|h\|}=\mathbf{0}_{\mathbb{R}^m}$$ Its interpretation is that for any constant $C$, $o(\|h\|)$ tends faster than $C.h$ to the zero vector, hence for small $h$ our linear approximation $dg[x].h$ is not "perturbed" by this reminder.
We can write the same thing for $f$: $$ f(y+k)=f(y)+df[y].k+o(\|k\|) $$
Now observe that:
\begin{align} f(g(x+h)) &=f(\overbrace{g(x)}^y+\overbrace{dg[x].h+o(\|h\|)}^k)) \\ &=f(g(x))+df[g(x)](dg[x].h+o(\|h\|))+o(\|dg(x).h+o(\|h\|)\|) \end{align}
Now from the definition of $o(.)$, it is not too hard to see that: $$ o(\|dg(x).h+o(\|h\|)\|)=o(\|h\|) $$ and (by linearity) = \begin{align} df[g(x)](dg[x].h+o(\|h\|)) &=df[g(x)](dg[x](h))+df[g(x)](o(\|h\|)) \\ &=df[g(x)](dg[x](h))+o(\|h\|) \end{align}
We finally get $$ f(g(x+h))=f(g(x))+df[g(x)](dg[x](h))+o(\|h\|) $$ and by identification $$d(f\circ g)[x]=df[g(x)]dg[x]$$ which is the expected result