Don't worry, multidimensional analysis is a shock to the system the first time that you see it.
1) The norm is the distance to the origin $d(x,0)$, assumed to be Euclidean, so $$\lVert x \rVert = \sqrt{\sum x_i^2}$$
2) Okay, let's get a grip on this. $$f:\mathbb{R}^n \to \mathbb{R}^m$$ so we have functions $f_1(x_1,\cdots,x_n), \ldots, f_m(\cdots)$. There are really $m$ unrelated functions $f_i:\mathbb{R}^n \to \mathbb{R}$
Great; now, what is a derivative for a function like $f_i$? It's the gradient $\nabla f_i$. Hence the whole lot of derivative information is encoded in a $m\times n$ matrix $A=J_f(u)$ with components $$A_{ij} = \partial_j f_i$$
Right! But now we want a formal definition of differentiability. How do we do it for a single $f_i=g$? We say $\nabla g$ is the derivative if it tells us what the *directional * derivatives are. How does it do this?
We want $g(x+dx) = g(x) + dx \cdot \nabla g$ - small changes get dotted with the derivative. So formally, we want a vector $v$ such that $$\lim_{h\to 0} |g(x+h) - g(x) - h \cdot v | / \lVert h \rVert = 0$$ so that the error is smaller than $h$.
But now for all the $f_i$ we get a different vector $v_i$. Putting these all together into $A$ and choosing the denominator to bound the error for any one $f_i$ gives the result you have!
Edit: To summarize, $h$ is a small change in the coordinates, $Ah$ is the directional derivative of all the separate $f_i$s in this direction, and $A$ is the 'gradient' containing all derivative information.
Best Answer
Note that, since your domain is convex, you can join any two points $x, y$ by the line segment $c(t) = x + t(y-x)$. Then $$f(y)-f(x)=\int_0^1 df(x+t(y-x))(y-x)\, dt$$ Now mulitply this with $y-x$, use the fact that $df$ is positive and the integral linear, to conclude that this is not $=0$ if $y\neq x$.