Derivative of a vector to a matrix

matricesmatrix-calculus

I am confused by matrix calculus for hours, since I found different versions of relative computations online.

Suppose there is a matrix $A \in \mathbb{R}^{L\times K}$, and a column vector $x \in \mathbb{R}^{K\times 1}$, now I want to take the derivative: $\frac{\partial Ax}{\partial A}$.

The first version of formula I found online suggested to simply take derivatives of the column vector w.r.t. matrix element-wisely, so for each scalar in that vector, I need to find the derivative of that scalar w.r.t to the matrix. Then I checked what did it say about the derivate of a scalar w.r.t. a matrix. It said that I need to take derivatives of that scalar w.r.t. each element of the matrix, element-wisely, so the result is a matrix with the same scale.

However, if the above statement was true, then for each of $L$ elements in $Ax$, I would get a matrix with scale $L\times K$. As a consequence, for $\frac{\partial Ax}{\partial A}$ I would get a matrix with scale $L^{2} \times K$. I think it is totally wrong because it deviates from the expected answer.

Then I found most of the answers online suggested that the result is actually $x^{T}\otimes \mathbb{I}$, and it raised furthermore questions to me.

Firstly, how may I get the correct result? How to look its derivation in an intuitive and understandable way?

Secondly, I also found the article "matrix calculus" on Wikipedia, it generalized almost every situation in a very long table, which is hard to remember and not comprehensive at all. Is there a universal rule or general solution to such problems?

Best Answer

$ \def\d{\delta} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\p{\partial}\def\grad#1#2{\frac{\p #1}{\p #2}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $The most straightforward approach is to write the equation using the Einstein summation convention, then calculate the gradient directly as $$\eqalign{ y_{i} &= A_{ij}\,x_{j} \\ \grad{y_{i}}{A_{pq}} &= \bigg(\grad{A_{ij}}{A_{pq}}\bigg)\,x_{j} \\ &= (\d_{ip}\d_{jq})\,x_{j} \\ &= \d_{ip}\,x_{q} \\ }$$ where $\d_{ij}$ is the Kronecker delta symbol.

As you can see, the result contains 3 free indices, and is therefore a third-order tensor.

Another approach is to consider a set of vector-valued gradients (one for each component of $A$) $$\eqalign{ y &= Ax \\ \grad{y}{A_{pq}} &= \bigg(\grad{A}{A_{pq}}\bigg)\,x = E_{pq}\,x = e_p\,e_q^Tx = e_p\,x_q \\ }$$ where $E_{pq}=e_pe_q^T\,$ is a matrix whose components are all zero except for the $(p,q)^{th}$ component which is equal to one. Similarly, $e_p$ is a vector whose components are all zero except for the $(p)^{th}$ component which is equal to one.

A related approach is to consider a set of matrix-valued gradients (one for each component of $y$) $$\eqalign{ y_{i} &= A_{ij}\,x_{j} \\ \grad{y_i}{A} &= \bigg(\grad{A_{ij}}{A}\bigg)\,x_{j} &= E_{ij}\,x_{j} &= e_{i}e_{j}^Tx_{j} &= e_{i}x^T \\ }$$

All of these results are awkward to work with and to calculate. I suspect that you think you need this particular third-order tensor because you are attempting to apply the Chain Rule as part of a bigger problem.

If that is the case, then you should be aware that there are methods for solving such problems which do not require these higher-order tensors.