Suppose $\Lambda = {\rm diag}(\lambda_1,\cdots,\lambda_n)$, $\Sigma = {\rm diag}(\sigma_1,\cdots,\sigma_n)$, and we have $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0$ and $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n \geq 0$. I want to prove that
$${\rm max}[ {\rm Tr}(O^{\rm T}\Lambda O \Sigma)] = \sum_{i=1}^n \lambda_i \sigma_i,$$
where $OO^{\rm T} = I$. One approach is to maximize the function
$$f(o_{ij}) = \sum_{i,j} \lambda_i (o_{ij})^2 \sigma_j$$
under the constraint $\sum_{k} o_{ik}o_{jk} = \delta_{ij}$ by Lagrange multiplier method. I can prove the statement in this way but the whole proof is lengthy and lack of the flavor of linear algebra. Since the statement "must be true" intuitively, I want to know if there is an elegant way to prove it. A possible way may be mathematical induction but I failed to make it. Any help is appreciated.
Linear Algebra – Maximizing the Trace in an Elegant Way
linear algebramatricesoptimizationtrace
Related Solutions
Define the constrained matrix variable $X$ in terms of an unconstrained matrix $Y$ as follows $$X = \frac{\alpha Y}{{\rm Tr}(Y)} \implies {\rm Tr}(X) = \alpha$$ The following variables will also be useful $$B = I+X\Sigma,\quad W^T=\Sigma B^{-1},\quad \tau={\rm Tr}(Y) = I:Y$$ Find the gradient of the function wrt $Y$ $$\eqalign{ \phi &= \log\det(B) \\ d\phi &= B^{-T}:dB \\ &= B^{-T}:dX\,\Sigma \\ &= W:dX \\ &= W:\Big({\alpha\tau^{-1}\,dY-\alpha\tau^{-2}Y(I:dY)}\Big) \\ &= \Big(\alpha\tau^{-1}W-(W:Y)\alpha\tau^{-2}I\Big):dY \\ \frac{\partial\phi}{\partial Y} &= \alpha\tau^{-1}W-(W:Y)\alpha\tau^{-2}I \\ }$$ Set the gradient to zero and solve, per usual for an unconstrained problem. $$\eqalign{ W &= (W:Y)\tau^{-1}I \;=\; \lambda I \\ \lambda I &= W = W^T = \Sigma B^{-1} = \Sigma(I+X\Sigma)^{-1} \\ \Sigma &= \lambda(I+X\Sigma) \\ X &= (\Sigma - \lambda I)(\lambda\Sigma)^{+} \;=\; \lambda^{-1}\Sigma\Sigma^{+}-\Sigma^{+} \\ }$$ where $\Sigma^+$ is the pseudoinverse of $\Sigma$.
All that remains is to find the value of $\lambda$, which can be calculated by enforcing the original trace constraint. $$\eqalign{ \alpha &= {\rm Tr}(X) \\ &= \lambda^{-1}{\rm Tr}(\Sigma\Sigma^{+}) - {\rm Tr}(\Sigma^{+}) \\ &= \lambda^{-1}{\rm rank}(\Sigma) - {\rm Tr}(\Sigma^{+}) \\ \lambda &= \frac{{\rm rank}(\Sigma)}{\alpha+{\rm Tr}(\Sigma^{+})} }$$ In the above derivation, the symmetry of $\Sigma$ was utilized, but the fact that it is diagonal was not needed. However, since $(\Sigma,\Sigma^+)$ are diagonal, $X$ is as well.
Finally, enforce non-negativity by taking $$X = (\lambda\Sigma)^{+}\max(0, \,\Sigma - \lambda I)$$
NB: In several steps, a colon is used to denote the trace/Frobenius product, i.e. $$A:B = {\rm Tr}(A^TB)$$
I assume $\sigma_1\neq 0$, which means that $\Sigma$ is invertible. Let $A=Q\Sigma^{-1}.$ Then $Q=A\Sigma$ and $Q^T=\Sigma A^T.$
We want $Q^T \Sigma$ to be symmetric, which means $Q^T\Sigma = \Sigma Q$ or $\Sigma A^T \Sigma = \Sigma A\Sigma.$ If we multiply this with $\Sigma^{-1}$ from both sides, we get $A=A^T,$ so $A$ is symmetric.
$Q$ is orthogonal, which means $Q^TQ=I$ or $$ \Sigma A^2 \Sigma = I $$ or $$ A^2 = \Sigma^{-2} $$ So we are looking for a square root of $\Sigma^{-2}$ and the problem boils down to the question if $\Sigma^{-1}$ is the only valid choice.
We must consider the case that $\Sigma$ has eigenvalues with multiplicity of more than $1.$
Let $\sigma_{r_i} = \sigma_{r_i+1} = \ldots = \sigma_{r_{i+1}-1}$ for $i=1,\ldots,m$ and $r_1=1,$ $r_2=2$ and $r_{m+1}=n+1.$ Furthermore, $\sigma_{r_i}<\sigma_{r_{i+1}}$ for $i=1,\ldots,m-1.$ Then each square root of $\Sigma^{-2}$ can be written as follows $$ A = \begin{pmatrix} \sigma_{r_1}^{-1} B_1 & & & & 0 \\ & \sigma_{r_2}^{-1} B_2 & & & \\ & & \sigma_{r_3}^{-1} B_3 & & \\ & & & \ddots & \\ 0 & & & & \sigma_{r_m}^{-1} B_m \end{pmatrix} \;\;,\;\; B_i^2 = I\;\;\mbox{for}\;\; i=1,\ldots,m $$ where $B_i$ are blocks of size $(r_{i+1}-r_i)\times (r_{i+1}-r_i).$ (The proof is given below)
Then $$ Q = \begin{pmatrix} B_1 & & & & 0 \\ & B_2 & & & \\ & & B_3 & & \\ & & & \ddots & \\ 0 & & & & B_m \end{pmatrix} $$ The $B_i$ are symmetric. $B_i^T$ is the inverse of $B_i$ because of the orthogonality of $Q$, and $B_i$ is also the inverse of $B_i$, because of the property $B_i^2=I.$ Therefore $B_i^T=B_i$ and $$ Q^T\Sigma = \begin{pmatrix} \sigma_{r_1}B_1 & & & & 0 \\ & \sigma_{r_2}B_2 & & & \\ & & \sigma_{r_3}B_3 & & \\ & & & \ddots & \\ 0 & & & & \sigma_{r_m}B_m \end{pmatrix} $$ We want $Q^T\Sigma$ to have the same eigenvalues as $\Sigma,$ which in turn means that $\sigma_{r_i}B_i$ has $\sigma_{r_i}$ as its only eigenvalue. A symmetric matrix with only one eigenvalue must be a scalar multiple of the identity matrix. Therefore, $B_i = I$ for $i,\ldots,m,$ which completes the proof.
Proof sketch for $\sigma_1=0$
If $\sigma_1=0,$ it can easily be shown that $Q_{11}\in\{-1,1\}$ and $Q_{1j}=Q_{j1}=0$ for $j=2,\ldots,n.$ This can be concluded from the symmetry of $Q^T\Sigma$ and from the orthogonality of $Q.$
This means that we can follow the argument from the first part of the proof, but consider only the subspace that is orthogonal to $e_1.$ Basically, this means that we ignore the first row and first column of all $n\times n$ matrices. In the end, we have to decide if $Q_{11}=1$ or $Q_{11}=-1.$ As $Q\in \mathrm{SO}(n)$ and $B_i=I$ for $i=2,\ldots,m,$ we can conclude $Q_{11}=1.$
Diagonalizable square roots of diagonal matrices
Let $A$ be diagonalizable and $A^2$ diagonal. Without loss of generality, the diagonal elements of $A^2$ are sorted in ascending order. Let $0\leq\lambda_1 < \lambda_2 < \ldots < \lambda_m$ such that the eigenvalues of $A$ form a (not necessarily strict) subset of $\{\lambda_1,\;-\lambda_1,\;\lambda_2,\;-\lambda_2,\;\ldots,\;\lambda_m,\;-\lambda_m\}.$ Let $t_i^{+}$ be the algebraic and geometric multiplicity of $\lambda_i$ and $t_i^{-}$ the algebraic and geometric multiplicity of $-\lambda_i$ within the matrix $A$ (we set $t_1^{-}=0$ if $\lambda_1=0.$) Let $r_1=1$ and $r_{i+1} = r_i + t_i^{+}+ t_i^{-}.$
If $Av = \lambda v$ and $Aw = -\lambda w,$ then $A^2 (v+w) = A^2 v + A^2 w =\lambda^2 v + (-\lambda)^2 w = \lambda^2 (v+w).$ This means that the eigenspace of $A^2$ with respect to the eigenvalue $\lambda^2$ is the direct sum of the eigenspaces of $A$ with respect to the eigenvalues $\lambda$ and $-\lambda.$
As $A$ is diagonalizable, the direct sum of the eigenspaces $E_{A,\lambda_1},$ $E_{A,-\lambda_1}$, $E_{A,\lambda_2},$ $E_{A,-\lambda_2},\ldots$, $E_{A,\lambda_m},$ $E_{A,-\lambda_m}$, forms the complete vector space $\mathbb{R}^n.$ This means that each of the eigenspaces of $A^2$ can be written as $E_{A,\lambda_i} \oplus E_{A,-\lambda_i}.$ In a manner of speaking, there is no room for other eigenspaces than those.
We know the eigenspaces of $A^2,$ because $A^2$ is diagonal. We have \begin{eqnarray*} E_{A^2,\lambda_1^2} & = & E_{A,\lambda_1} \oplus E_{A,-\lambda_1} = \mathrm{span}\{e_{r_1},\ldots,e_{r_2-1}\} \\ & \vdots & \\ E_{A^2,\lambda_m^2} & = & E_{A,\lambda_m} \oplus E_{A,-\lambda_m} = \mathrm{span}\{e_{r_m},\ldots,e_{r_{m+1}-1}\} \end{eqnarray*} with the standard basis $e_1,\ldots,e_n.$ Now it is clear that $A$ can be diagonalized by means of a block matrix, because each $E_{A,\lambda_i} \oplus E_{A,-\lambda_i}$ is spanned by the related elements of the standard basis. $$ A= \begin{pmatrix} S_1 & & 0 \\ & \ddots & \\ 0 & & S_m \end{pmatrix} \begin{pmatrix} \lambda_1 I_{t_1^{+}} & & & & 0 \\ & -\lambda_1 I_{t_1^{-}} & & & \\ & & \ddots & & \\ & & & \lambda_m I_{t_m^{+}} & \\ 0 & & & & -\lambda_m I_{t_m^{-}} \end{pmatrix} \begin{pmatrix} S_1 & & 0 \\ & \ddots & \\ 0 & & S_m \end{pmatrix} ^{-1} $$ From this, by simply processing the matrix multiplication, we can conclude that $A$ itself is also a block matrix of the same sort, i.e. $$ A = \begin{pmatrix} A_1 & & 0 \\ & \ddots & \\ 0 & & A_m \end{pmatrix} $$ with $$ A_i = S_i\,\begin{pmatrix} \lambda_i I_{t_i^{+}} & \\ & -\lambda_i I_{t_i^{-}} \\ \end{pmatrix} \, S_i^{-1} $$ Now we only have to show that $A_i = \lambda_i B_i$ with $B_i^2=I.$
Let $T_i=S_i^{-1}$.
Let $S_i^{+}$ be the $(t_i^{+}+t_i^{-})\times t_i^{+}$ matrix that is formed by the first $t_i^{+}$ columns of $S_i$ and $S_i^{-}$ the $(t_i^{+}+t_i^{-})\times t_i^{-}$ matrix that is formed by the last $t_i^{-}$ columns of $S_i.$ Let $T_i^{+}$ be the $t_i^{+}\times (t_i^{+}+t_i^{-})$ matrix that is formed by the first $t_i^{+}$ rows of $T_i$ and $T_i^{-}$ the $t_i^{-}\times (t_i^{+}+t_i^{-})$ matrix that is formed by the last $t_i^{-}$ rows of $T_i.$
Then $T_i^{+}S_i^{+}=I,\;\;T_i^{-}S_i^{-}=I,\;\;T_i^{+}S_i^{-}=0,\;\;T_i^{-}S_i^{+}=0$. $$ A_i = S_i^{+}\lambda_i T_i^{+} + S_i^{-}(-\lambda_i) T_i^{-} = \lambda_i \left( S_i^{+}T_i^{+} - S_i^{-}T_i^{-}\right) $$ Let $B_i = S_i^{+}T_i^{+} - S_i^{-}T_i^{-}.$ Then \begin{eqnarray*} B_i^2 & = & \left( S_i^{+}T_i^{+} - S_i^{-}T_i^{-}\right)\left( S_i^{+}T_i^{+} - S_i^{-}T_i^{-}\right) \\ & =& S_i^{+}T_i^{+}S_i^{+}T_i^{+}-S_i^{+}T_i^{+}S_i^{-}T_i^{-}-S_i^{-}T_i^{-}S_i^{+}T_i^{+}+S_i^{-}T_i^{-}S_i^{-}T_i^{-} \\ & =& S_i^{+}\cdot I\cdot T_i^{+}-S_i^{+}\cdot 0 \cdot T_i^{-}-S_i^{-}\cdot 0 \cdot T_i^{+}+S_i^{-}\cdot I\cdot T_i^{-} \\ & =& S_i^{+}T_i^{+}+S_i^{-}T_i^{-} \\ & =& \begin{pmatrix} & & \\ S_i^{+} & & S_i^{-} \\ & & \end{pmatrix} \begin{pmatrix} & T_i^{+} & \\ & & \\ & T_i^{-} & \end{pmatrix} =S_iT_i = I \end{eqnarray*}
Best Answer
(Rewritten from the last part of my answer to Trace minimization with constraints .)
Since the objective function is continuous and the set of all real orthogonal matrices is compact, the maximum trace is continuous in the entries of $\Sigma$. Therefore, by passing to limit, we may assume without loss of generality that $\Sigma$ has distinct diagonal entries.
The matrix exponential of every skew-symmetric matrix is a real orthogonal matrix of determinant one. If $O$ is a global maximiser in your problem, its objective function value must be greater than or equal to the objective function value evaluated at $Oe^{tK}$ (which is real orthogonal) for any real number $t$ and any skew-symmetric $K$. So, if we define $f(t)=\operatorname{tr}(e^{-tK}O^T\Lambda Oe^{tK}\Sigma)$, we must have $f'(0)=0$. Using the cyclic property $\operatorname{tr}(XY)=\operatorname{tr}(YX)$, we get $$ 0=f'(0)=\operatorname{tr}\left((\Sigma O^T\Lambda O-O^T\Lambda O\Sigma)\, K\right). $$ Put $K^T=\Sigma O^T\Lambda O-O^T\Lambda O\Sigma$ (which is indeed skew-symmetric), we get $\operatorname{tr}(K^TK)=0$. Hence $K=0$, meaning that $O^T\Lambda O$ commutes with $\Sigma$. Any matrix that commutes with a diagonal matrix with distinct diagonal entries must have all off-diagonal entries equal to zero. Therefore $O^T\Lambda O$ is a diagonal matrix.
Hence $\Lambda_1:=O^T\Lambda O$ is a diagonal permutation of $\Lambda$, because $\Lambda_1$ and $\Lambda$ are diagonal matrices with the same eigenvalues. Now the problem boils down to finding the diagonal permutation $\Lambda_1$ of $\Lambda$ that maximises $\operatorname{tr}(\Lambda_1\Sigma)$. Obviously, maximum occurs when $\Lambda_1=\Lambda$ or when $O=I$.