Derivative of a row vector with respect to a column vector; derivative with respect to a row vector assuming numerator layout.

calculuslinear algebramatrix-calculus

Refer: https://en.wikipedia.org/wiki/Matrix_calculus#Vector-by-vector
I see above that, assuming the numerator layout, the derivative of a $(m \times 1)$ column vector $\textbf{y}$ with respect to a $(n \times 1)$ column vector $\textbf{x}$ gives a $(m \times n)$ matrix with $(i,j)$-th element $\frac{\partial y_i}{\partial x_j}$.

Refer: https://youtu.be/WrH-jpJIqFQ?list=PLhcN-s3_Z7-YS6ltpJhjwqvHO1TYDbiZv&t=222
I see above how it can be obtained by employing the Kronocker product (?)

1. How is the derivative of a row vector with respect to a column vector defined? How is the resultant matrix obtained – what is its dimension and what will be the $(i,j)$-th element?

2. How is the derivative with respect to a row vector defined?

Background:
I was going through the identities in

  1. https://en.wikipedia.org/wiki/Matrix_calculus#Vector-by-vector_identities
    I can understand using above sources that $\frac{\partial A\textbf{x}}{\partial \textbf{x}}=A$. But I can't get $\frac{\partial \textbf{x}^TA}{\partial \textbf{x}}=A^T$
  2. https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-vector_identities
    I can't understand how to take derivative with respect to $x^T$ (a row vector) in $\frac{\partial^2 f}{\partial x \partial x^T} = H^T$

Best Answer

Too long for a comment.

The usual mathematical definition of a derivative for $f:\mathbf{R}^p \to \mathbf{R}^q$ will yield $f'(x) = \left[ \dfrac{\partial f_i}{\partial x_j} \right]$ of dimensions $(q,p).$

In statistics, they often define, for $f:\mathbf{Mat}_{(p,q)} \to \mathbf{R},$ $\dfrac{\partial f}{\partial X} = \left[ \dfrac{\partial f}{x_{ij}} \right],$ which is a matrix of dimensions $(p,q).$ When $q = 1,$ we have the matrices of dimension $(p,1)$ and these can be identified with column vectors which is the usual way to write vectors and, therefore, they identify with $\mathbf{R}^p.$ Then, the statistician believes for $f:\mathbf{R}^p \to \mathbf{R},$ $\dfrac{\partial f}{\partial x}$ is of dimension $(p,1)$ (a column vector). This clashes with the definition above. The clash is resolved by taking transposes. The statistician will also use "$\partial x^\intercal$" to mean "take the transpose of this matrix derivative." Thus, $\dfrac{\partial f}{\partial x^\intercal}$ will be of dimension $(1,p)$ and coincides with the standard definition in mathematics.

How to resolve the issue? Trace dimensions, that is it. Both definitions will yield the same numbers but arranged differently.

Why does the statistician do this? The main reason is that if $f,$ a real-valued function of the matrix $X,$ is differentiated and then the derivative evaluated in some "increment" $H = (h_{ij}),$ then we must get $\sum\limits_{i,j} h_{ij} \dfrac{\partial f}{\partial x_{ij}}.$ It is just algebra to show that this is equal to $$ \mathrm{tr} \left( H^\intercal \dfrac{\partial f}{\partial X} \right). $$ This proves to be a very efficient way to represent the derivative of matrix functions and it resembles the gradient and dot product for the standard case. That is, if $f:\mathbf{R}^p \to \mathbf{R},$ then $$ f'(x) \cdot h = \nabla f(x)^\intercal h $$ (in fact, the two expressions coincide since the trace of a scalar is itself). The so-called "matrix calculus" is therefore developed ad-hoc to satisfy the statistician. Many of their formulas are concise and useful, albeit cumbersome to derive.