Intuition behind “Transpose Matrix”

intuitionlinear algebratranspose

I have come across the differential $\frac{\partial \bf{w}}{\partial \bf{w}^T}$ many times now, and I notice that it is equivalent to the transpose operator. That is, if we have something of the form $\bf{A}\frac{\partial w}{\partial w^T}$, we can rewrite it as $\bf{A}^T$ whenever the dimensions of $\bf{A}$ and $\frac{\partial \bf{w}}{\partial \bf{w}^T}$ agree.

Intuitively, the change of $\bf{w}$ with respect to $\bf{w}^T$ is a transpose. Is my intuition valid and is there a way to prove this?

$\bf{EDIT1}$: To clarify, $\bf{A}$ is a matrix while $\bf{w}$ and $\bf{w}^T$ are vectors. The vector $\bf{w}$ is taken to be a column vector of dimension $n$ and $\bf{A}$ is taken to be a $n \times n$ matrix. From this we have that $\frac{\partial \bf{w}}{\partial \bf{w}^T}$ is a $n \times n$ matrix $\bf{W}$ where $W_{ij} = \frac{\partial \bf{w}_i}{\partial \bf{w}^T_j}$. The question that follows now is, why does $\bf{W}$ act as a transpose operator on $\bf{A}$?

$\textbf{EDIT2:}$ To give an example, when computing $\nabla_\bf{w}$MSE$_{\text{train}}$ in this post (last response), the author has a step that goes from $\bf{wX^TX}\frac{\partial \bf{w}}{\partial \bf{w}^T}$ to $(\bf{wX^TX})^T$

Best Answer

Sadly, there is no intuition to be learned, only that matrix differential calculus has inconsistent notation.

I wouldn't see $\frac{\partial w}{\partial w}$ as something that "does transposition": the result you mention follows from the linearity of differentiation and from using the denominator convention.

By linearity of differentiation, $$A \frac{\partial w}{\partial w} = \frac{\partial A w}{\partial w}$$

If we follow numerator convention, this is the Jacobian matrix of the linear function $Aw$, that is a matrix whose $i,j$th element is $$\frac{\partial A_i^T w}{\partial w_j} ,$$ where $A_i^T$ is the $i$th row of $A$. Then, $$\left[\frac{\partial A w}{\partial w}\right]_{i,j} =\frac{\partial }{\partial w_j} \sum_{k=1}^m A_{i,k}w_k=A_{i,j},$$ where $A_i^T$ is the $i$th row of $A$, so $$A \frac{\partial w}{\partial w} = A.$$

If we instead follow denominator convention, then your expression means the gradient of the linear function $Ax$, that is the matrix whose $i,j$ element is

$$\frac{\partial A_j^T w}{\partial w_i}$$ Then $$\left[\frac{\partial A w}{\partial w}\right]_{i,j} =\frac{\partial }{\partial w_i} \sum_{k=1}^m A_{j,k}w_k = A_{j,i},$$ so $$A \frac{\partial w}{\partial w} = A^T.$$

The fact that the author uses a transposed vector at the denominator seems to indicate the numerator convention, however the transposed result hints at a denominator convention. Confusing indeed!