Suppose the vector-valued function $\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m$ has the (total) derivative at $\mathbf{x_0}\in \mathbb{R}^n$ denoted by $\mathrm{d}_\mathbf{x_0}\mathbf{f}$. It is a linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$. It gives the (total) differential of the function $\mathbf{f}$ at $\mathbf{x_0}$ as a function mapping from $\mathbb{R}^n$ to $\mathbb{R}^m$ by applying to the vector variable $\mathbf{x}$ near $\mathbf{x_0}$ to give $\mathrm{d}_\mathbf{x_0}\mathbf{f}\left(\mathbf{x}-\mathbf{x_0}\right)$. With respect to standard basis sets $\left\{\mathbf{\hat{a}}_i\right\}_{i=1}^{n}$ and $\left\{\mathbf{\hat{b}}_i\right\}_{i=1}^{m}$ of $\mathbb{R}^n$ and $\mathbb{R}^m$, respectively, the total derivative $\mathrm{d}_\mathbf{x_0}\mathbf{f}$ corresponds to the $ m \times n$ matrix called the Jacobian matrix
$$\left(\mathrm{d}_\mathbf{x_0}\mathbf{f}\right)=\left(\begin{matrix}\frac{\partial f_1}{\partial x_1}&&\cdots&&\frac{\partial f_1}{\partial x_n}\\\vdots&&\ddots&&\vdots\\\frac{\partial f_m}{\partial x_1}&&\cdots&&\frac{\partial f_m}{\partial x_n}\end{matrix}\right),\\\mathbf{x_0}=x_i \mathbf{\hat{a}}_i,\mathbf{f}\left(\mathbf{x}\right)=f_i\left(\mathbf{x}\right)\mathbf{\hat{b}}_i$$
On the other hand, the gradient of $\mathbf{f}$ donoted by $\nabla\mathbf{f}$ is a linear transformation from $\mathbb{R}^m$ back to $\mathbb{R}^n$, defined, with respect to the same standard basis sets, so that the corresponding matrix of it is the $n \times m$ matrix
$$\left(\nabla\mathbf{f}\right)=\left(\begin{matrix}\frac{\partial f_1}{\partial x_1}&&\cdots&&\frac{\partial f_m}{\partial x_1}\\\vdots&&\ddots&&\vdots\\\frac{\partial f_1}{\partial x_n}&&\cdots&&\frac{\partial f_m}{\partial x_n}\end{matrix}\right)$$
Note, at least with respect to standard basis sets, that the gradient is the transpose of the total derivative.
The variation of function $\mathbf{f}$ at $\mathbf{x_0}$ in the direction of unit vector $\mathbf{u} \in \mathbb{R}^n$, i.e. the directional derivative of $\mathbf{f}$, denoted by $\mathrm{D}_\mathbf{u}\mathbf{f}\left(\mathbf{x_0}\right)$ is a vector in $\mathbb{R}^m$ given by applying the total derivative on $\mathbf{u}$,
$$\mathrm{D}_\mathbf{u}\mathbf{f}\left(\mathbf{x_0}\right)=\mathrm{d}_\mathbf{x_0}\mathbf{f}\mathbf{u}$$
In the special case of $m=1$, i.e. scalar valued function, the rhs of the equation above is a product of a $1\times n$ matrix and an $n\times1$ matrix, leading to a scalar. It happens that, in this particular case, the matrix product equals also to the dot product of the gradient of the function and the unit vector. That's why when talking about scalar functions the textbooks always link the gradient with the directional derivative by the dot product as rule of calculation. However, we cannot generalize directly the dot-product rule to vector valued functions.
About your appendix, if we use the tensor product of the base vectors of two vector spaces as the basis to express a linear transform between these two vector space, we must be careful about the dimensions. In fact in tensor analysis we already have a rigorous and general definition. But here suppose we redefined something only for the specific purpose describe by this question.
Let $\mathcal{V}_n$ denote an $n$-dimensional vector space on $\mathbb{R}$. A linear transformation $\mathbf{A}:\mathcal{V}_n\rightarrow\mathcal{W}_m$ has its $m\times n$ matrix representation $A_{ji}$ under basis sets $\left\{\mathbf{\hat{e}}_i\right\}\in\mathcal{V}_n$ and $\left\{\mathbf{\hat{f}}_i\right\}\in\mathcal{W}_m$ can be obtained by acting on the former by $\mathbf{A}$ to give $n$ vectors $\mathbf{u}_i=A_{ji}\mathbf{\hat{f}}_j $. When acting on a vector $\mathbf{c}\in\mathcal{V}_n$ we get $\mathbf{d}\in\mathcal{W}_m$, $\mathbf{d}=\mathbf{Ac}$. Under the basis set we have the matrix calculation of this transformation
$$\left(\begin{matrix}d_1\\\vdots\\d_m\end{matrix}\right)=\left(\begin{matrix}A_{11}&&\cdots&&A_{1n}\\\vdots&&&&\vdots\\A_{m1}&&\cdots&&A_{mn}\end{matrix}\right)\left(\begin{matrix}c_1\\\vdots\\c_n\end{matrix}\right)$$
or $d_j=A_{ji}c_i$.
On the other hand, if under a certain definition of tensor product of two vectors, the tensor product of $\mathbf{v}\in\mathcal{V}_n$ and $\mathbf{w}\in\mathcal{W}_m$ is expressed with respect to the same basis sets as $\mathbf{v}\otimes\mathbf{w}=v_i w_j \mathbf{\hat{e}}_i\otimes\mathbf{\hat{f}}_j$ the resulted tensor corresponds to the $n\times m$ matrix representation $v_iw_j$. To construct a tensor that can act on vectors of $\mathcal{V}_n$ by $\mathbf{v}$ and $\mathbf{w}$ we have to use $\mathbf{w}\otimes\mathbf{v}$.
Therefore, to express the linear transformation $\mathbf{A}$ by the two basis sets, it should be in the form $\mathbf{A}=A^\prime_{ij}\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j$. To see the relation between the two $m\times n$ matrices $A_{ji}$ and $A^\prime_{ij}$ we applied again on vector $\mathbf{c}$, this time using the expression with $A^\prime_{ij}$, and requiring that the results to be $\mathbf{d}$. We get in this case $\mathbf{d}=A^\prime_{ij}c_k\left(\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j\right)\mathbf{\hat{e}}_k$.
To proceed we need an addition postulation in the present discussion, that it is a rule that
$$\left(\mathbf{w}\otimes\mathbf{v}\right)\mathbf{c}=\mathbf{w}\left(\mathbf{v}\cdot\mathbf{c}\right)$$
Then, $\mathbf{d}=A^\prime_{ij}c_k\left(\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j\right)\mathbf{\hat{e}}_k=A^\prime_{ij}c_k\mathbf{\hat{f}}_i\left(\mathbf{\hat{e}}_j\cdot\mathbf{\hat{e}}_k\right)$. We again have to end here unless in the special case that $\left\{\mathbf{\hat{e}}_i\right\}$ is orthonormal basis set. In this case $\mathbf{d}=A^\prime_{ij}c_j\mathbf{\hat{f}}_i$ or $d_i=A^\prime_{ij}c_j$. By a comparison and care on the subscripts we know that $A^\prime_{ij}=A_{ij},i=1,\cdots,n,j=1,\cdots,m$.
We can now conclude that the linear transformation $\mathbf{A}:\mathcal{V}_n\rightarrow\mathcal{W}_m$ can be expressed with respect of orthonormal basis sets $\left\{\mathbf{\hat{e}}_i\right\}\in\mathcal{V}_n$ and $\left\{\mathbf{\hat{f}}_i\right\}\in\mathcal{W}_m$ as $\mathbf{A}=A_{ij}\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j$ under the rule $\left(\mathbf{w}\otimes\mathbf{v}\right)\mathbf{c}=\mathbf{w}\left(\mathbf{v}\cdot\mathbf{c}\right)$.
So the (total) derivative of function $\mathbf{f}$, $\mathrm{d}_\mathbf{x_0}\mathbf{f}$, i.e. the "correct" linear transformation we used to act on a unit vector to get a directional derivative, should be expressed as $\mathrm{d}_\mathbf{x_0}\mathbf{f}=\frac{\partial f_i}{\partial x_j}\mathbf{\hat{b}}_i\otimes\mathbf{\hat{a}}_j$ (since standard basis are orthonormal).
I don’t know how well you already know stuff so I’m going to just justify everything I can. There’s a paragraph about the gradient at the bottom. Also the logic extends so I’ll just talk about functions with 2 inputs (surfaces) instead of three. It might be more helpful. You also might need some visuals and I think there are several things online or I believe khan academy has some helpful visuals even though I’m not always a fan of their regular videos.
Consider the multivariable chain rule. Recall that it works intuitively because the change in z arises from the combined changes in x and in y. Also you can try some practical thought experiments help make it clear, all with the basic idea that for $z(x(t),y(t))$, the change of the height z as you go through time (or more generally some other parameter) is the amount it changes due to x plus the amount it’s changing due to y since they are independent. You could readily substitute amount and rate in that previous sentence, which yields the multivariable chain rule, as was desired.
Now, this $f(x(t),y(t))$ can be interpreted as a path above a graph on the xy plane, which goes along the surface $f$. When the graph (x,y) is a straight line and you move along it with constant speed, you have the derivative as you move in that direction (not just x or y, which are the partial derivatives). If you move at different speeds, like twice the speed, the height changes twice as much in the same amount of time so the directional derivative also doubles. Directional derivatives, however, we take when moving at a constant speed (so they’re like a regular partial derivative but moving in a different direction), thus the vector is normalized. That’s the interpretation of not normalizing the vector (which is just moving at a different speed) cause I don’t feel like normalizing this: suppose $\frac{dy}{dt}=2$, and $\frac{dx}{dt}=3$ (I’ll try to look up and make the partial derivative symbol in there later). As you move along the vector 2,3 in a unit of time, the rate of change of z is these substituted into the multivariable chain rule, which is algebraically equivelant to the gradient dot the vector you move along in a period.
Additionally, the dot product is saying how much length of one vector there is per length of the other along it’s direction, because this is equivalent to projecting one vector onto another and multiplying their lengths. Note: it doesn’t matter which is projected onto which because the dot product is symmetric. Anyways, locally the gradient is the direction of steepest ascent, and dotting a vector with it is just saying how much is this unit motion lined up along this direction of main (positive) ascent, the quantity you’re measuring. This comparison to the gradient is logically the same as saying how much of the increase (gradient) is in a particular direction, hence, directional derivative.
*The gradient is how a function ascends, ie a certain amount in one direction plus a certain amount in the other(s), or: $i\frac{df}{dx}+j\frac{df}{dy}$ (again, sorry about the latex, and that’s meant to be i hat and j hat). Viewed in this light, it only makes sense that “how it’s ascending” ascends in the direction of steepest ascent. It’s just as a gradient of a single variable function is a vector telling you loosely “how it’s changing”, and if you can connect this to the idea that the vector (which is just a scalar in this case) points in the direction of increase according to how it’s increasing you can see the analogy for higher dimensional gradients. Of course you could also prove this analytically in several ways, one of which is working backwards and optimizing the angle of the directional derivative (which weirdly isn’t circular reasoning and doesn’t require you to already know about the gradient).
I haven’t thought through this too much but since every analytic function is has a tangent plane you could figure it out by doing it on a plane.
And again, choosing some real world or more abstract variables with this and doing a thought experiment is fun. And like Mohammad said, like normal slope, it’s a scalar.
Best Answer
$\newcommand{\dd}{\partial}$Let's say for definiteness that $f$ is a real-valued function of three variables, defined in some neighborhood of a point $x_{0}$. (The story holds with obvious modifications for any number of variables.)
The "primitive" concept is "differentiability": We say $f$ is differentiable at $x_{0}$ if there exists a linear function $L$ of three variables satisfying $$ \lim_{h \to 0} \frac{|f(x_{0} + h) - f(x_{0}) - Lh|}{|h|} = 0. $$ One straightforwardly shows $L$ is unique, and introduces the notation $Df(x_{0}) := L$.
The matrix of $Df(x_{0})$ with respect to the standard basis has the partial derivatives of $f$ as components, more-or-less by definition of a partial derivative: $$ Df(x_{0}) = [\begin{array}{@{}ccc@{}} D_{1}f(x_{0}) & D_{2}f(x_{0}) & D_{3}f(x_{0}) \\ \end{array}] = \Bigl[\begin{array}{@{}ccc@{}} \frac{\dd f}{\dd x}(x_{0}) & \frac{\dd f}{\dd y}(x_{0}) & \frac{\dd f}{\dd z}(x_{0}) \\ \end{array}\Bigr]. $$
Because of how linear functions work (which amounts to the chain rule here), if $v = (v_{1}, v_{2}, v_{3})$ is a vector, then \begin{align*} \frac{d}{dt}\bigg|_{t=0} f(x_{0} + tv) = Df(x_{0})v &= v_{1} D_{1}f(x_{0}) + v_{2} D_{2}f(x_{0}) + v_{3} D_{3}f(x_{0}) \\ &= v_{1} \frac{\dd f}{\dd x}(x_{0}) + v_{2} \frac{\dd f}{\dd y}(x_{0}) + v_{3} \frac{\dd f}{\dd z}(x_{0}). \end{align*}
This expression may be interpreted as a Euclidean dot product. We're led to give a name to the resulting vector $\nabla f(x_{0})$, a.k.a. the transpose of $Df(x_{0})$. That's "where the gradient comes from".
As for properties of the gradient, they follow from the preceding equation and properties of the dot product. If $\theta$ is the angle between $\nabla f(x_{0})$ and $v$, the directional derivative of $f$ at $x_{0}$ along $v$ is $$ D_{v}f(x_{0}) = \frac{d}{dt}\bigg|_{t=0} f(x_{0} + tv) = \nabla f(x_{0}) \cdot v = |\nabla f(x_{0}|\, |v| \cos\theta. $$ For $|v|$ fixed, $D_{v}f(x_{0})$ is maximized when $\theta = 0$ ($v$ is positively proportional to the gradient), miminized when $\theta = \pi$ ($v$ is negatively proportional to the gradient), and $0$ when $\theta = \pi/2$ ($v$ is orthogonal to the gradient).