In general, the derivative of a function $f : \mathbb{R}^n \to \mathbb{R}^m$ at a point $p \in \mathbb{R}^n$, if it exists, is the unique linear transformation $Df(p) \in L(\mathbb{R}^n,\mathbb{R}^m)$ such that
$$
\lim_{h \to 0} \frac{\|f(p+h)-f(p)-Df(p)h\|}{\|h\|} = 0;
$$
the matrix of $Df(p)$ with respect to the standard orthonormal bases of $\mathbb{R}^n$ and $\mathbb{R}^m$, called the Jacobian matrix of $f$ at $p$, therefore lies in $M_{m \times n}(\mathbb{R})$.
Now, suppose that $m=1$, so that $f : \mathbb{R}^n \to \mathbb{R}$. Then if $f$ is differentiable at $p$, $Df(p) \in L(\mathbb{R}^n,\mathbb{R}) = (\mathbb{R}^n)^\ast$ is a functional, and hence the Jacobian matrix, as you point out, lies in $M_{1 \times n}(\mathbb{R})$, i.e., is a row vector. However, by the Riesz representation theorem, $\mathbb{R}^n \cong (\mathbb{R}^n)^\ast$ via the map that sends a vector $x \in \mathbb{R}^n$ to the functional $y \mapsto \left\langle y,x \right\rangle$. Hence, if $f$ is differentiable at $p$, then the gradient of $f$ at $p$ is the unique (column!) vector $\nabla f(p) \in \mathbb{R}^n$ such that
$$
\forall h \in \mathbb{R}^n, \quad Df(p)h = \left\langle \nabla f(p),h\right\rangle;
$$
in particular, if you unpack definitions, you'll find that the Jacobian matrix of $f$ at $p$ is precisely $\nabla f(p)^T$.
$\require{amsmath} $
This can require a longish answer, depending on where we're starting from - or how much you want! I hope you are familiar with matrices, vectors - in a slightly abstract form - and linear transformations.
Depending on the context, there is at least one 'a priori' definition for the gradient $\nabla f$ of $f\colon \mathbb R^n \rightarrow \mathbb R$, other than the vector of partial derivatives $\partial f/ \partial x_i$, although any defintion should match up with that in this context. Suppose we have a nice path
$$ \gamma \colon \mathbb R \rightarrow \mathbb R^n.$$
Then, $f\circ \gamma \colon \mathbb R \rightarrow \mathbb R$ is a calculus one style function, and if $\gamma ( 0 ) = p \in \mathbb R^n$,
$$ (f\circ \gamma )' ( 0 ) = \sum {\partial f \over \partial x_i} (p) \ \gamma_i'(0). $$
In one interpretation - the one you have in mind, I think - and the usual one when one thinks 'gradient'- the sum on the right is a dot product between the gradient vector $\nabla f|_p$ and the vector $\gamma'(0)$:
$$ (f\circ \gamma )'\, ( 0 ) = \nabla f|_p \cdot \gamma'(0).\tag{*}$$
Another interpretation is to write the sum as a matrix multiplication:
$$\left({\partial f \over \partial x_1} (p), \cdots, {\partial f \over \partial x_n} (p) \right) \left(\matrix{ \gamma_1'(0) \\ \vdots \\\gamma_n'(0) }\right), $$
i.e., to think of the row matrix of partials of $f$ - let's denote it $f'(p)$ - applied to the (column) vector $\gamma'(0)$:
$$ (f\circ \gamma )' \,( 0 ) = f'(p) \ \gamma'(0). \tag{**}$$
On the one hand, with this formulation, we are no longer thinking of the collection of partials as a vector (the gradient $\nabla f|_p$), but as a linear transformation, $f'((p)$, applied to a tangent vector $\gamma'(0)$ at $p$, with image, another tangent vector $ (f\circ \gamma )' \,( 0 ) $ at $(f\circ \gamma )\,(0) = f(p)$.
On the other, this formulation suggests "chain rule" - does it not?
With that in mind, if $ f \colon \mathbb R^n \rightarrow \mathbb R^m$, the above suggests that the matrix, let's call it $f'(p)$,
$$ \pmatrix { {\partial f_1\over \partial x_1}(p) &\cdots &{\partial f_1\over \partial x_n}(p)\cr
\vdots & & \vdots \cr
{\partial f_m\over \partial x_1}(p) &\cdots &{\partial f_m\over \partial x_n}(p)
} $$
is a linear transformation, taking tangent vectors at $p$ to tangent vectors at $q=f(p)$. Namely, $t \mapsto f\circ \gamma (t)$ is a curve in $\mathbb R^m$, with tangent vector at $q$,
$$( f\circ\gamma )'\,(0) = f'(p)\ \gamma'(0),$$
where the multiplication on the right is matrix multiplication.
To keep to the calculus one formulation one shyould also consider a function $g \colon \mathbb R^m \rightarrow \mathbb R$: then $t \mapsto (g\circ f \circ \gamma) (t) $ is an ordinary real-valued function, and has a 'cal one' style derivative at $t=0$, but one calculates it - chain rule style in higher dimensions - by:
$$ (g\circ f \circ \gamma)'\, (0) = g'(q)\ f'(p)\ \gamma'(0),$$
where the multiplication on the RHS is matrix multiplication.
To return more explicitly to your question - I believe that it is somewhat unusual to use gradient notation in the context of the Jacobian matrix: 'usually' gradient means there is a 'dot product' type of context...
To illustrate: In $\mathbb R^3$, as you know, if $f\colon \mathbb R^3 \to \mathbb R $ is a nice map, and $f(p) = b$, the normal (dot product = 0) of the tangent space at $p$ of the level surface $f(x) =b$ is $\nabla f|_p$, and, by the by, measures the direction and magnitude of maximal change - a vector, i.e., the gradient, of course.
[Also (more generally?), differential geometry often comes with dot-product structures on tangent spaces - e.g., one can say that a tangent plane of a sphere inherits a dot product from the one in the ambient $\mathbb R^3$ - and one can talk of the gradient of a function $f$ at $p$ as the tangent vector $({\rm grad} f)(p)$ on some geometric space (e.g., the surface of a sphere $S$) which satisfies $({\rm grad} f)(p) \cdot v = f'(p) v$, for all tangent vectors $v$ to $S$ at $p$, and where the $\cdot$ is the dot product, and where I am understanding $f'(p)$ as I have used it in this answer - as a derivative/linear map.]
This answer uses the $f'$ notation to emphasize the chain rule... But, depending on context, the (various?) $f'(p)$ of this answer is (are?) also often denoted $ df_p$, $\partial f/ \partial x|_p$, or $Df_p$, or etc... For instance, in this question $\gamma' (0)$ showed up as tangent vector - but drinking my own Kool Aid, one could equally well think of it as a linear map from the tangent space of $\mathbb R$ at $t = 0$...
This happens in cal one: when we write $f'(0)$, does one mean a slope of the tangent line - i.e. as a matrix, a linear map - or a number (rate of change)?
In any event, the identification of gradient with matrix is the identification of the right hand sides of equations $(*)$ and $(**)$.
As you see there is tower of Babel of identification going on, and to be clear about; one could 'choose' non-conflicting notation, but there is a lot of history here, and one shouldn't be arthritic...
Hope this helps.
Best Answer
Yes, the distinction between row vectors and column vectors is important. On an arbitrary smooth manifold $M$, the derivative of a function $f : M \to \mathbb{R}$ at a point $p$ is a linear transformation $df_p : T_p(M) \to \mathbb{R}$; in other words, it's a cotangent vector. In general the tangent space $T_p(M)$ does not come equipped with an inner product (this is an extra structure: see Riemannian manifold), so in general we cannot identify tangent vectors and cotangent vectors.
So on a general manifold one must distinguish between vector fields (families of tangent vectors) and differential $1$-forms (families of cotangent vectors). While $df$ is a differential form and exists for all $M$, $\nabla f$ can't be sensibly defined unless $M$ has a Riemannian metric, and then it's a vector field (and the identification between differential forms and vector fields now depends on the metric).
If one thinks of tangent vectors as column vectors, then $\nabla f$ ought to be a column vector, but the linear functional $\langle -, \nabla f \rangle$ ought to be a row vector. A major problem with working entirely in bases is that distinctions like these are frequently glossed over, and then when they become important students are very confused.
Some remarks about non-canonicity. The tangent space $T_p(V)$ to a vector space at any point can be canonically identified with $V$, so for vector spaces we don't run into quite the same problems. If $V$ is an inner product space, then in the same way it automatically inherits the structure of a Riemannian manifold by the above identification. Finally, when people write $V = \mathbb{R}^n$ they frequently intend $\mathbb{R}^n$ to have the standard inner product with respect to the standard basis, and this equips $V$ with the structure of a Riemannian manifold.