[Math] Is it possible to define a Hessian Matrix for a Matrix-valued function

matrix equationsmatrix-calculusnonlinear optimization

So I'm doing a project on optimization (non-negative matrix factorization), which I know is not convex, from this question:

Why does the non-negative matrix factorization problem non-convex?

However this was addressed only for the scalar case which is not my project's focus.

My question is: How am I supposed to define a gradient and a Hessian matrix for more general cases. Is it possible? Is it still called a Hessian matrix or is it some sort of tensor (of which I don't really know much).

My function:

$$ f = \min_{W, H} \left\lVert X \; – \; WH \right\rVert_{F}^{2} $$

Which is equivalent to:

$$ \min_{W, H} f = tr\lbrack (X \; – \; WH)(X \; – \; WH)^{T} \rbrack = tr\lbrack (X \; – \; WH)^{T}(X \; – \; WH) \rbrack $$

I know how to calculate the partial derivatives of this function, and I acutally have:

$$ \frac{\partial f}{\partial W} = \frac{\partial tr\lbrack (X \; – \; WH)(X \; – \; WH)^{T} \rbrack}{\partial W} = -2XH^{T} + 2WHH^{T}$$

Which is the same result as equation (24) found in this document:

http://cal.cs.illinois.edu/~johannes/research/matrix%20calculus.pdf

Applying the same idea, i calculated the other partial derivative:

$$ \frac{\partial f}{\partial H} = \frac{\partial tr\lbrack (X \; – \; WH)(X \; – \; WH)^{T} \rbrack}{\partial H} = -2W^{T}X + 2W^{T}WH$$

If both of these are correct, is it possible to define a vector with entries that are matrices such that:

$$ \nabla f(W, H) = \left(\frac{\partial f}{\partial W} \;\; \frac{\partial f}{\partial H} \right)$$

And what would be the way to compute the Hessian? (If it makes sense).

Please do not mark as duplicate. The question
Defining the Hessian of a function that takes general matrices as an input has no appropriate answer on how to compute these but rather an example to disprove the OP's method and an insult.

Best Answer

Define a new matrix $$Y=WH-X$$ Write the function in terms of this new variable $$f = \|Y\|^2_F = Y:Y$$ where a colon denotes the trace/Frobenius product, i.e. $\,\,A:B={\rm tr}(A^TB)$

Find its differential and gradients. $$\eqalign{ df &= 2Y:dY \cr &= 2Y:(dW\,H+W\,dH) \cr &= 2YH^T:dW + 2W^TY:dH \cr &= 2(WH-X)H^T:dW + 2W^T(WH-X):dH \cr \frac{\partial f}{\partial W} &= 2(WH-X)H^T,\quad \frac{\partial f}{\partial H} = 2W^T(WH-X) \cr }$$ Since the gradients are themselves matrices, the hessians will be 4th order tensors which cannot be represented in matrix notation.

One way to approach the hessian is to use vectorization which flattens matrices into vectors.
For example, $$\eqalign{ G &= \frac{\partial f}{\partial W} = 2WHH^T - 2XH^T \cr dG &= 2\,dW\,HH^T \cr {\rm vec}(dG) &= 2\,{\rm vec}(dW\,HH^T) \cr dg &= 2\,(HH^T\otimes I)\,dw \cr \nabla_{ww}f &= 2\,(HH^T\otimes I) \cr }$$ Working through the other hessians $$\eqalign{ \nabla_{hh}f &= 2\,(I\otimes W^TW) \cr \nabla_{wh}f &= 2(H\otimes W) + 2(I\otimes Y)K \cr \nabla_{hw}f &= 2(H^T\otimes W^T) + 2(Y^T\otimes I)K \cr }$$ where $K$ is the Commutation Matrix which can be used to vectorize the transpose of a matrix $${\rm vec}(X^T) = K\,{\rm vec}(X)$$