If $\nabla f({\bf x}_0) \not= {\bf 0}$, then the Jacobian of $f$ (i.e. $\nabla f$) has maximal rank at ${\bf x}_0$. This means the implicit function theorem can be applied so that $\{ {\bf x} \in \mathbb{R}^{n} \,|\, f({\bf x})={\bf c} \}$ is a submanifold of $\mathbb{R}^n$. This means that about each point in the level set there is a diffeomorphism between a neighborhood of that point and an open set in $\mathbb{R}^{n-1}$.
At this point, we know the level set has a well defined tangent space. There are $n-1$ curves whose tangent vectors are linearly independent. Then we can apply the standard argument to each of these curves. Using the chain rule, we have $f({\bf r}(t))={\bf c}$ $\Rightarrow$ $\nabla f({\bf r}(t)) {\bf \cdot} {\bf r}'(t) = 0$. So the gradient is orthogonal to each tangent and thus is orthogonal to the level set.
So you are correct. The implicit function theorem is being used to guarantee that the curves we need actually exist.
Edit: A few more details.
Take a point on the level surface, say ${\bf x}_0 = (x_1,\dots,x_{n-1},y_0)=({\bf z}_0,y_0)$. Suppose that $\nabla f({\bf x}_0) \not=0$. For convenience, suppose that the last component of the gradient is non-zero.
Then there exists a region $D$ in $\mathbb{R}^{n-1}$ of points "close to" ${\bf z}_0$ such that $g(t_1,\dots,t_{n-1})$ is a function from $D$ to $\mathbb{R}$ and $f(t_1,\dots,t_{n-1},g(t_1,\dots,t_{n-1}))={\bf c}$ for all $(t_1,\dots,t_{n-1})$ in $D$ [This is the implicit function theorem in action. It allowed us to "solve" for the last variable in terms of the others.] Now we can define ${\bf r}_i(t)=(x_1,\dots,x_{i-1},t,x_{i+1},\dots,x_{n-1},g(x_1,\dots,x_{i-1},t,x_{i+1},\dots,x_{n-1}))$. We have ${\bf r}_i(x_i)={\bf x}_0$ and $f({\bf r}_i(t))={\bf c}$. This gives us $n-1$ curves on our level set.
These methods are just applications of two different geometric ideas to help you find a normal vector to a surface. I'm sure that you or I could do some variable pushing and prove that they are compatible, but I don't know how enlightening that would be. I think the most important thing is just to understand the geometry behind each of these ideas.
When you have a parametrized surface $r(u,v) = \left< x(u,v), y(u,v), z(u,v) \right>$ and a point $(u_0,v_0)$, you can consider two cross sections of that surface. The functions $$r(u_0,v) = \left< x(u_0,v), y(u_0,v), z(u_0,v) \right>$$
$$r(u,v_0) = \left< x(u,v_0), y(u,v_0), z(u,v_0) \right>$$
define curves in three dimensions which are contained in the plane $r(u,v)$. Convince yourself that a tangent vector to any curve contained in a surface is also tangent to the surface itself. Therefore the vectors
$$ \frac{\partial}{\partial v} r(u_0,v) \big|_{v=v_0} $$
and
$$ \frac{\partial}{\partial u} r(u,v_0) \big|_{u=u_0} $$
are both tangent to the surface at $r(u_0,v_0)$. Convince yourself that if these two vectors were parallel, then $r$ wouldn't look like a curve at this point, rather than a surface, so they should not be parallel. In linear algebra terms, these vectors span the space of tangent vectors. Their cross product will yield a vector which is normal to both of them, and therefore normal to the plane. This is the definition you stated.
The other definition uses the fact that the gradient of a function at a point is perpendicular to the level surface at that point. To understand this, it is helpful to think of the lower dimensional analogy. The gradient of a function $f(x,y)$ (which defines a surface) will be perpendicular to the level curve at any point. This is geometrically obvious if $f(x,y)$ defines a plane. The level curve will be a horizontal line, and the gradient will point in the direction of greatest slope of the plane. The same logic works, in fact, for $f(x,y)$ that is not a plane because the differentiability of $f$ tells us that it behaves like a plane at any given point.
Best Answer
Gradient points towards the direction in which function value is increasing (in maximum sense). I guess the value of the function is increasing as you move away from the origin as I can see from the level curves. Hence it is pointing away from the origin.
Elaboration:
Say you are in mountain, then all rings at constant heights are level curves. Now the gradient is the direction in which if you move the value of the function (the height) will increase, also if you take the negative gradient direction, you will start descending. Question is why it happens.
The answer lies in how slope is defined. Take one variable case, say $\frac{\partial f}{\partial x} $, if this is positive then going along the direction of the slope (i.e. increasing $x$) will increase the value of my function. If the slope is negative, the function value will start increasing as we move along the slope (in -ive $x$ direction, remember the slope is negative) . The similar analogy applies to multivariable case.
Please refer a good text (may be Thomas), first learn how slope is defined in arbitrary direction, and then see how it applies in gradient case (using dot product).