Derivative of Mahalanobis pairwise distance matrix respect to transformation matrix

machine learningmahalanobis distancematricesmatrix-calculusoptimization

For a set of vectorial observations $x_i$ stored in the matrix $X$, I would like to obtain the gradient of the pairwise Mahalanobis distance matrix $D$ with respect to the Mahalanobis transformation matrix $M$ ($\frac{\partial D}{\partial M}$), given that:

$D_{ij} = (x_i – x_j)^TM(x_i – x_j)$

edit: M should be positive semi-definite, or alternatively $M = A^T A$

I have found how the euclidean pairwise distance matrix can be expressed as in a vectorized form (Pairwise distance matrix), and how the Mahalanobis distance between two elements can be derived with respect to the matrix M (Differentiating mahalanobis distance), but I have not been able to put everything together.

Best Answer

$ \def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\bR#1{\big(#1\big)} \def\vecc#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Sym#1{\operatorname{Sym}\!\BR{#1}} \def\sym#1{\operatorname{Sym}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} \def\c#1{\color{red}{#1}} $First, define some new variables $$\eqalign{ B &= \tfrac 12D \\ y_k &= Mx_k \\ X &= \m{x_1&x_2&\ldots&x_n}\;\in{\mathbb R}^{m\times n} \\ Y &= \m{y_1&y_2&\,\ldots&\,y_n}\;= MX \qiq \c{dY = dM\,X} \\ J &=X\oslash X \qiq J_{ij}=\o\quad\big({\rm all\!-\!ones\;matrix}\big) \\ X &= J\odot X \\ {\cal D}_Y &= {\rm Diag}\BR{\!\vecc{Y}}\,\in{\mathbb R}^{mn\times mn} \\ }$$ where $\{\odot,\oslash\}$ denote elementwise/Hadamard multiplication and division.

Then rewrite the problem in pure matrix notation and calculate its differential. $$\eqalign{ 2B &= (X\odot Y)^TJ + J^T(X\odot Y) - X^TY - Y^TX \\ &= 2\,\Sym{J^T(X\odot{Y}) - X^T{Y}} \\ dB &= \Sym{J^T(X\odot{\c{dY}}) - X^T{\c{dY}}} \\ &= \Sym{J^T(X\odot\LR{\c{dM\,X}}) - X^T{\c{dM\,X}}} \\ }$$ where $\;\sym{A} \doteq \tfrac 12\LR{A+A^T}$

At this point, one can calculate a component-wise gradient $$\eqalign{ \grad{B}{M_{ij}} &= \Sym{J^T(X\odot\LR{E_{ij}\,X}) - X^T{E_{ij}\,X}} \\ }$$ where $E_{ij}$ is a single-entry matrix whose $(i,j)$ element is $\o$ and all others are zero.

In terms of the original variable $$\eqalign{ \grad{B}{M_{ij}} &= \frac 12 \LR{\grad{D}{M_{ij}}} \\ }$$ The problem with the requested gradient $\LR{\grad DM}$ is that it's a fourth-order tensor, which cannot be written in standard matrix notation.

Update

You've added a constraint to your original question, i.e. $$M=A^TA \qiq \c{dM=dA^TA+A^TdA}$$ The procedure to find the new gradient is straightforward. Write the differential, change the independent variable from $P\to A$, then isolate the gradient.

Here is the calculation for the component-wise gradient $$\eqalign{ dB &= \Sym{J^T(X\odot\LR{\c{dM\,X}}) - X^T\LR{\c{dM\,X}}} \\ &= \Sym{J^T(X\odot\LR{\c{dA^TAX+A^TdA\,X}} - X^T\LR{\c{dA^TAX+A^TdA\,X}}} \\ &= 2 \Sym{J^T\bR{\!\LR{AX}\odot\LR{dA\,X}\!} - X^TA^TdA\,X} \\ \grad{B}{A_{ij}} &= 2 \Sym{J^T\bR{\!\LR{AX}\odot\LR{E_{ij}X}\!}-X^TA^TE_{ij}X} \\ }$$ The linked paper is pretty good when compared to most Machine Learning papers, but the mathematics is still far too "hand-wavy". It's geared towards an audience who just want to code the gradient expression directly into their Python programs without bothering to check the math.

Deriving their gradient result using matrix notation would be a herculean effort. Even converting their result into matrix notation would be very difficult, because the nested sum violates the usual summation convention since it contains an expression with four $i$-indices and four $j$-indices. Conventional index notation only permits two of any single index.

If you want someone to decode that mess for you then, as Steph suggested, you should post a new question $-$ I'll let him answer that one.

Related Question