Solved – Optimization with orthogonal constraints

optimizationorthogonal

I am working on computer vision, and have to optimize an objective function involves matrix $X$ and matrix $X$ is an orthogonal matrix.

$$maximize \ \ f(X)$$

$$ s.t \ \ X^T X=I$$
Where $I$ is the unit matrix. I am reading some paper and they talked complicated terminology like optimization over Grassmanian manfiold, Stiefel manifold. Basically, the gradient on those manifolds will be different than the regular gradients over Euclidean spaces.

Can you suggest any easy paper to read so that I can implement the gradient-based method on those manifolds. If there are any simple Matlab examples that illustrate the documents will be helpful

Thanks

Best Answer

I found the following paper to be useful:

Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2), 303-353.

This paper has far more information than you may need, in terms of differential geometry and higher-order optimization methods.

However the information to answer your question is actually quite straightforward, and is in fact contained mostly in equations 2.53 and 2.70, which are both of the form $$\nabla{f}=G-X\Phi$$ where the nominal gradient $$G=\frac{\partial{f}}{\partial{X}}$$ is corrected to constrained gradient $\nabla{f}$ by subtracting off its projection $\Phi$ onto the current solution $X$. This is the normal to the manifold, similar to circular motion, and ensures the corrected gradient is tangent to the manifold.

Note: These formulas assume you are already on the manifold, i.e. $X^TX=I$. So in practice you will need to make sure your initial condition is appropriate (e.g. $X_0=I$, possible rectangular). You may also need to occasionally correct accumulated roundoff and/or truncation error (e.g. via SVD, see "ZCA" example below).

In the unconstrained case, $\Phi=0$, while in the constrained case $\Phi$ takes two forms: $$\Phi_{\mathrm{\small{G}}}=X^TG\implies\nabla_{\mathrm{\small{G}}}{f}=\left(I-XX^T\right)G$$ which corresponds to the "Grassmann manifold". The distinction here is that $\nabla_{\mathrm{\small{G}}}$ is insensitive to rotations, since for a rotation $Q^TQ=I$ and $X=X_0Q$, we have $XX^T=X_0X_0^T$.

The second form is $$\Phi_{\mathrm{\small{S}}}=G^TX\implies\nabla_{\mathrm{\small{S}}}{f}=G-XG^TX$$ which corresponds to the "Stiefel manifold", and is sensitive to rotations.

A simple example is approximation of a given matrix $A\in\mathbb{R}^{n\times{p}}$ with $p\leq{n}$ by an orthogonal matrix $X$, minimizing the least-squares error. In this case we have $$f[X]=\tfrac{1}{2}\|X-A\|_F^2=\tfrac{1}{2}\sum_{ij}\left(X_{ij}-A_{ij}\right)^2\implies{G}=X-A$$ The unconstrained case $\nabla{f}=G$ has solution $X=A$, because we are not concerned with ensuring $X$ is orthogonal.

For the Grassmann case we have $$\nabla_{\mathrm{\small{G}}}{f}=\left(XX^T-I\right)A=0$$ This can only have a solution is $A$ is square rather than "skinny", because if $p<n$ then $X$ will have a null space.

For the Stiefel case, we have $$\nabla_{\mathrm{\small{S}}}{f}=XA^TX-A=0$$ which can be solved even when $p<n$.

These two cases, Grassmann vs. Stiefel, essentially correspond to the difference between "PCA vs. ZCA whitening". In terms of the SVD, if the input matrix is $A=USV^T$, then the solutions are $X_{\mathrm{\small{G}}}=U$ and $X_{\mathrm{\small{S}}}=UV^T$. The PCA solution $X_{\mathrm{\small{G}}}$ only applies to a square input, i.e. $A$ must be the "covariance matrix". However the ZCA solution $X_{\mathrm{\small{S}}}$ can be used when $A$ is a "data matrix". (This is more properly known as the Orthogonal Procrustes problem.)

Related Question