I don’t know how well you already know stuff so I’m going to just justify everything I can. There’s a paragraph about the gradient at the bottom. Also the logic extends so I’ll just talk about functions with 2 inputs (surfaces) instead of three. It might be more helpful. You also might need some visuals and I think there are several things online or I believe khan academy has some helpful visuals even though I’m not always a fan of their regular videos.
Consider the multivariable chain rule. Recall that it works intuitively because the change in z arises from the combined changes in x and in y. Also you can try some practical thought experiments help make it clear, all with the basic idea that for $z(x(t),y(t))$, the change of the height z as you go through time (or more generally some other parameter) is the amount it changes due to x plus the amount it’s changing due to y since they are independent. You could readily substitute amount and rate in that previous sentence, which yields the multivariable chain rule, as was desired.
Now, this $f(x(t),y(t))$ can be interpreted as a path above a graph on the xy plane, which goes along the surface $f$. When the graph (x,y) is a straight line and you move along it with constant speed, you have the derivative as you move in that direction (not just x or y, which are the partial derivatives). If you move at different speeds, like twice the speed, the height changes twice as much in the same amount of time so the directional derivative also doubles. Directional derivatives, however, we take when moving at a constant speed (so they’re like a regular partial derivative but moving in a different direction), thus the vector is normalized. That’s the interpretation of not normalizing the vector (which is just moving at a different speed) cause I don’t feel like normalizing this: suppose $\frac{dy}{dt}=2$, and $\frac{dx}{dt}=3$ (I’ll try to look up and make the partial derivative symbol in there later). As you move along the vector 2,3 in a unit of time, the rate of change of z is these substituted into the multivariable chain rule, which is algebraically equivelant to the gradient dot the vector you move along in a period.
Additionally, the dot product is saying how much length of one vector there is per length of the other along it’s direction, because this is equivalent to projecting one vector onto another and multiplying their lengths. Note: it doesn’t matter which is projected onto which because the dot product is symmetric. Anyways, locally the gradient is the direction of steepest ascent, and dotting a vector with it is just saying how much is this unit motion lined up along this direction of main (positive) ascent, the quantity you’re measuring. This comparison to the gradient is logically the same as saying how much of the increase (gradient) is in a particular direction, hence, directional derivative.
*The gradient is how a function ascends, ie a certain amount in one direction plus a certain amount in the other(s), or: $i\frac{df}{dx}+j\frac{df}{dy}$ (again, sorry about the latex, and that’s meant to be i hat and j hat). Viewed in this light, it only makes sense that “how it’s ascending” ascends in the direction of steepest ascent. It’s just as a gradient of a single variable function is a vector telling you loosely “how it’s changing”, and if you can connect this to the idea that the vector (which is just a scalar in this case) points in the direction of increase according to how it’s increasing you can see the analogy for higher dimensional gradients. Of course you could also prove this analytically in several ways, one of which is working backwards and optimizing the angle of the directional derivative (which weirdly isn’t circular reasoning and doesn’t require you to already know about the gradient).
I haven’t thought through this too much but since every analytic function is has a tangent plane you could figure it out by doing it on a plane.
And again, choosing some real world or more abstract variables with this and doing a thought experiment is fun. And like Mohammad said, like normal slope, it’s a scalar.
Given a function $f$ of one or more variables, if you pick an input $\mathbf x_0$ for the function $f$ (where we write $\mathbf x_0$ in bold face to indicate that it can be a vector of several variables), you can then define another function that is the change in $f$ as the input of $f$ changes away from $\mathbf x_0.$ That is, you can define a function that takes $\mathbf h,$ the amount by which we change the input, and produces an output value by the rule
$$ \mathbf h \to f(\mathbf x_0 + \mathbf h) - f(\mathbf x_0).$$
The function $f$ is differentiable at $\mathbf x_0$ if you can use a linear function of $\mathbf h$ to approximate this "difference function" arbitrarily in a neighborhood around $\mathbf x_0,$ that is, if you only look at small enough changes.
The derivative of $f$ is the "multiplier" of that linear function.
If $f$ is a single-variable function, you can plot the graph $y = f(x)$ in two dimensions, and if you can put a line tangent to that graph at the point
$(x=x_0, y = f(x_0)),$ it gives you a linear approximation of how much $f(x)$ varies as $x$ varies around $f(x_0),$ and the derivative of $f$ at that point, $f'(x_0),$ is the slope of the line.
In two dimensions you can plot the three-dimensional graph $z = f(x,y),$ and if you can put a plane tangent to that graph at the point
$(x=x_0, y = y_0, z = f(x_0,y_0)),$ you again have a linear approximation of how much the value of $f$ changes as its input changes.
But the "slope" of this plane cannot fully be described by a single number.
One way to describe the slope of the plane is by specifying the direction in which the plane is tilted and the slope $m$ of the plane in that direction. If you travel a distance $h$ in that direction the plane rises $mh.$ It falls an equal amount in the opposite direction, but if you travel perpendicular to that direction on the plane you don't rise or fall at all.
But since the plane rises or falls in a linear fashion depending on which direction you go and how far, you can measure its slope in the $x$ and $y$ directions and use those two numbers to find how much you rise or fall by traveling anywhere on the plane.
The slope in the $x$ direction is $\frac{\partial f}{\partial x}$ and it tells you that if you travel from $(x_0,y_0)$ to $(x_0+h_x,y_0),$ the plane rises by
$\frac{\partial f}{\partial x}h_x.$
To the extent that the plane is a good approximation of $f$ in the neighborhood of $(x_0,y_0),$
we can say $f(x_0,y_0) + \frac{\partial f}{\partial x}h_x$
is a good approximation of $f(x_0+h_x,y_0).$
Likewise, the slope in the $y$ direction is $\frac{\partial f}{\partial y}$; if you travel from $(x_0,y_0)$ to $(x_0,y_0+h_y),$ the plane rises by
$\frac{\partial f}{\partial y}h_y$; and
$f(x_0,y_0) + \frac{\partial f}{\partial y}h_y$
is an approximation of $f(x_0,y_0+h_y).$
What happens if you both travel $h_x$ in the $x$ direction and $h_y$ in the $y$ direction?
Since the plane is the plot of a linear function, the change in height is just the sum of what you would get by going only in the $x$ direction and what you get by going only in the $y$ direction, that is, you reach the height
$$f(x_0,y_0) + \frac{\partial f}{\partial x}h_x + \frac{\partial f}{\partial y}h_y,$$
which is an approximation of $f(x_0+h_x,y_0+h_y).$
Your directional vector $\vec v = a \hat\imath + b \hat\jmath$
says you travel a distance $h_x = a$ in the $x$ direction and $h_y = b$ in the $y$ direction, so the formula above gives an increase equal to
$$\frac{\partial f}{\partial x}a + \frac{\partial f}{\partial y}b.$$
But when we say that $\vec v$ is a directional vector, we usually have in mind a unit vector, that is, $a^2 + b^2 = 1.$
Your intuition that a "nudge" of $\partial x$ in the $x$ direction and a "nudge"
$\partial y$ in the $y$ direction add up to a "nudge" of
$\sqrt{\partial x^2 +\partial y^2}$ is an accurate description of the magnitude of the combined "nudge," but it doesn't say anything about the direction of the "nudge." As the direction of the "nudge" gets closer to the $x$ direction,
the $x$-direction slope $\frac{\partial f}{\partial x}$ becomes more important
and the $y$-direction slope $\frac{\partial f}{\partial y}$ becomes less important
to the change in height, and vice-vesa as the direction of the "nudge" gets closer to the $y$ direction.
The linear function $a\frac{\partial f}{\partial x} + b\frac{\partial f}{\partial y}$
takes those relative influences into account.
Best Answer
Let’s back up a bit. As Hans Ludmark points out in his comment above, the basic definition of the directional derivative in the direction specified by the unit vector $\mathbf u=(u_1,u_2)$ at a point $P=(a,b)$ is via a limit similar to the one from elementary calculus: $${\partial f\over\partial\mathbf u}(a,b)=\lim_{h\to0}{f(a+hu_1,b+hu_2)-f(a,b)\over h}.$$ As you’ve observed, this amounts to taking a vertical slice through the surface and then computing the ordinary derivative of that slice, as illustrated below.
This derivative is, of course, the slope of the tangent line (blue) to the slice at that point. Observe that this line is also the intersection of the tangent plane at that point (grayish blue) with the cutting plane (violet), so we can interpret the directional derivative as the steepness of the tangent plane in a given direction. As you rotate the cutting plane around $P$, the slope of this line changes, reaching a maximum when the two planes are perpendicular, as we’ll see below. (You can also see that this is the case by visualizing cutting a cylinder parallel to the $z$-axis by a plane and imagining what happens to the high point as you move that plane around.)
Let’s say that the tangent plane is given by the equation $\lambda x+\mu y-z=d$ with normal $\mathbf n_t=(\lambda,\mu,-1)$. A normal to the cutting plane is $\mathbf n_c=(-u_2,u_1,0)$, which is just $\mathbf u$ rotated ninety degrees. In $\mathbb R^3$ we can find the direction of the line of intersection via a cross product: $$\mathbf n_t\times\mathbf n_c=(u_1,u_2,\lambda u_1+\mu u_2)$$ and the slope of this line is thus $${\lambda u_1+\mu u_2\over\sqrt{u_1^2+u_2^2}}=\lambda u_1+\mu u_2=(\lambda,\mu)\cdot\mathbf u=\|(\lambda,\mu)\|\cos\phi,$$ where $\phi$ is the angle between the projection of $\mathbf n_t$ onto the $x$-$y$ plane and $\mathbf u$. The slope is therefore maximal when $\phi=0$, i.e., when $\mathbf u$ and the projection of $\mathbf n_t$ point in the same direction, but this happens when the two planes are perpendicular. The maximum value of this slope is $\|(\lambda,\mu)\|$.
This is where the gradient of $f$ comes in. If we write the equation of the surface as $F(x,y,z)=f(x,y)-z=0$, then $\nabla F=(f_x,f_y,-1)$ is normal to the surface, so an equation of the tangent plane at $(a,b,f(a,b))$ is $$xf_x(a,b)+yf_y(a,b)-z=af_x(a,b)+bf_y(a,b)-f(a,b).$$ This is exactly in the form analyzed above, with $\lambda=f_x(a,b)$ and $\mu=f_y(a,b)$, so $${\partial f\over\partial\mathbf u}(a,b)=\nabla f(a,b)\cdot\mathbf u$$ with the maximal rate of change given by $\|\nabla f(a,b)\|$.
This seems awfully coincidental, but it’s not. Going back to the plane equation $\lambda x+\mu y-z=d$ above, the coefficients $\lambda$ and $\mu$ are respectively the “$x$-slope” and “$y$-slope,” i.e., the slopes of the intersections with planes parallel to the $x$- and $y$-axes. These slopes are encoded in the normal $(\lambda,\mu,-1)$. For the tangent plane, these slopes are the directional derivatives in the directions of the coordinate axes, also known as the partial derivatives of $f$.