Why is the derivative of scalar with respect to vector a vector and not a scalar

calculusderivativespartial derivative

I'm really confused about matrix calculus and especially partial derivatives. When do we need to sum up partial derivatives to get a total derivative and when do we get a vector of partial derivatives as our derivative? I struggle with differentiating between the two. I'll make an example to make it clear:

L is a scalar, $\mathbf{o}$ is a vector of size $K$ and $\mathbf{y}$ is a vector of size $K$.

$$L = -\sum_{k} \log(y_k)$$
$$\mathbf{y} = \text{softmax}(\mathbf{o})$$

So if we want to have the derivative of L with respect to $\mathbf{o}$, we would need to sum over all the partial derivatives with respect to the terms $\mathbf{y}$ so that we get the total derivative, that is as much as I understood from reading about multivariate calculus:

$$\frac{\partial L}{\partial \mathbf{o}} = \frac{\partial L}{\partial \mathbf{y}}\frac{\partial \mathbf{y}}{\partial \mathbf{o}} = \sum_{k}\frac{\partial L}{\partial y_k}\frac{\partial y_k}{\partial \mathbf{o}} =
-\sum_{k} \frac{1}{y_k} \frac{\partial y_k}{\partial \mathbf{o}}$$

However, then $\frac{\partial L}{\partial \mathbf{o}}$ seems to be a vector of the partial derivatives of L with respect to every term of $\mathbf{o}$, i.e.:

$$ \frac{\partial L}{\partial \mathbf{o}} = \left< \frac{\partial L}{\partial o_1}, \frac{\partial L}{\partial o_2}, …, \frac{\partial L}{\partial o_K} \right> $$

But shouldn't the derivative be the sum of all the partial derivatives of $\mathbf{o}$ to get the total derivative?

i.e. shouldn't the solution be:

$$\frac{\partial L}{\partial \mathbf{o}} = \frac{\partial L}{\partial \mathbf{y}}\frac{\partial \mathbf{y}}{\partial \mathbf{o}} = -\sum_{k} \frac{1}{y_k} \sum_{i} \frac{\partial y_k}{\partial o_i}$$

and then its just a scalar?

Best Answer

The derivative of a function $f : R^n \to R^m$ is the linearization (i.e. approximation by a linear function) of the function around the given point. Therefore, it must still be a function $R^n \to R^m$, but linear. This is represented by a matrix in $R^{m \times n}$. If the output dimension is $m = 1$, i.e. $f$ is a scalar function, that matrix has the shape of a row vector in $R^{1 \times n}$.

In your case, if $L : R^n \to R$, then $\frac{\partial L}{\partial \mathbf{y}}$ is $1 \times n$, while $\frac{\partial \mathbf{y}}{\partial \mathbf{o}}$ is $n \times n$. The first sum $\sum_k$ that you wrote is the "row-vector $\times$ matrix" multiplication of those two.