[Math] Derivation of normal equations for minimization of Frobenius norm least squares error

least squaresmatrix equationsmatrix-calculusregression

I'm having a hard time understanding the most efficient sequence of steps for deriving the normal equations for Frobenius norm least squares minimization. Here I want to minimize the norm of a matrix directly, rather than arguing I can do it row-by-row, because I want to improve my facility with matrix calculations.

"We observe pairs $(x_i,y_i)$ with $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}^n$. We’ll let $i$ range from $1$ to $L$. Let $X = [x_1,\ldots ,x_L]$ denote the matrix whose columns are the examples $x_i$. Similarly, let $Y = [y_1,\ldots , y_L]$. We will try to fit a function of the form
$$f(y) = Wy + v$$
where $W \in \mathbb{R}^{d\times n}$ and $v \in \mathbb{R}^d$. Let $\mathbf{1}_L$ denote the vector of length $L$ with ones in all entries."

So we want to minimize $$ \|WY + v \mathbf{1}^T_L – X\|_F^2 \, .$$

My confusion is that the correct normal equations seem to be

\begin{align}
& (WY + v \mathbf{1}^T_L – X) \mathbf{1}_L = 0 \\
& (WY + v \mathbf{1}^T_L – X) Y^T = 0
\end{align}
but I do not obtain this directly by differentiation, setting the appropriate Jacobians to zero. At least not when I think I'm being careful. Rather, I obtain it through a way-too-complicated sequence of steps. Can someone clarify where I'm going wrong? No need to read all of the following (carefully) if you just know the answer. Maybe there's an obvious choice of partials one should take, in an easily generalizable manner?

First, to optimize w.r.t. $v$, I employ the chain rule for Jacobians. We note that $\| A(v) \|_F^2 = \sum_{ij} A(v)_{ij}^2$, and by the chain rule,
\begin{align*}
J_v \bigl[ \| WY + v \mathbf{1}^T_L – X \|_F^2 \bigr] =& \ J_{(WY + v \mathbf{1}^T_L – X)}\bigl( \|\cdot\|_F^2\bigr) J_v\bigl(WY + v \mathbf{1}^T_L – X\bigr)\ \dot{=}\ 0 \in \mathbb{R}^{1 \times d}
\end{align*} must be satisfied. (I will sometimes refer to $WY + v \mathbf{1}^T_L – X$ as `$A$.')
By $J_{(WY + v \mathbf{1}^T_L – X)}\bigl( \|\cdot\|_F^2\bigr)$ I mean the Jacobian of the Frobenius norm, evaluated at $WY + v \mathbf{1}^T_L – X$. This would appear to be a $(1 )\times (d \times L)$ tensor, since we're looking at the derivative of the function $\| \cdot \|_F^2: \mathbb{R}^{d\times L} \to \mathbb{R}$.

This first Jacobian $J \| A\|_F^2 $ has elements $J_{1,i,j} = \frac{\partial}{\partial A_{ij}} \sum_{kl} A_{kl}^2 = 2 A_{ij} = 2(WY + v \mathbf{1}^T_L – X\bigr)_{ij}$.

Next we need the Jacobian $J_v\bigl(WY + v \mathbf{1}^T_L – X\bigr) = J_v (v \mathbf{1}^T_L) $. At first I had written that $\frac{\partial}{\partial v} v \mathbf{1} = \mathbf{1}$. This gives the correct normal equation:
$$(WY + v \mathbf{1}^T_L – X) \mathbf{1}_L = 0 \, . $$

However, $\mathbf{1}^T_L$ is not the Jacobian of $v \mathbf{1}^T_L$, since the latter is rank one $d \times L$ matrix. Probably the answer lies in finding what the previous expression is the derivative of.

Instead, this Jacobian $ J_v (v \mathbf{1}^T_L)$ is a $(d\times L) \times (d)$ tensor, since it's the matrix of the derivative of a function $\mathbb{R}^d \to \mathbb{R}^{d \times L}$. Since $v \mathbf{1}^T_L$ is linear in $v$, the $i$th partial derivative can be calculated by replacing $v$ with $e_i$, getting that the $d\times L$ matrix we denote as $J_v(v \mathbf{1}^T)_{:,:,i}$ is $e_i \mathbf{1}^T_L$.

Then, when we multiply these tensors, we'll get a $1 \times d$ Jacobian, the transpose of the gradient w.r.t. $v$, and set that equal to $0$. How do we write that multiplication? Summing over the matching $(d \times L)$ indices, the composite derivative given by the chain rule equals (?) $\frac{1}{2}D_{1,j} = \sum_{k,l} A_{kl} (e_j \mathbf{1}^T_L)_{kl} = a_j^T \mathbf{1}$, the $j$th row sum of $A = WY + v \mathbf{1}^T_L – X \ \dot{=}\ 0$.
Since that's true for every $j$, this system satisfies

\begin{align}
(WY + v \mathbf{1}^T_L – X) \mathbf{1} = 0
\end{align}

which is the formula in my professor's notes.

From this we follow the notes and get that $$v = \mu_x^{\text{emp}} – W \mu_y^\text{emp} \, .$$
(The $\mu$s are empirical averages, over the $L$ instances.)
At this point, the professor “plugs this back in and solves for $W$.''

To repeat the steps above we need the Jacobian $J_W(WY)$. This is a $(d \times L) \times (d \times n)$ tensor. The first order condition with respect to $W$ is a condition on the $1\times (d \times n)$ composite Jacobian of the Frobenius norm with respect to $W$. The $(i,j)$ partial, $(i,j) \in (1,\ldots, d) \times (1, \ldots, n)$, (the effect of varying the $(i,j)$ component of $W$) is the matrix $e_{i,j} Y = e_i y_j^T$, which places the $j$th row of $Y$ in the $i$th place. (By $e_{i,j}$ I denote the matrix with a $1$ in the $(i,j)$ place and zeros elsewhere.) So again applying the chain rule we sum out the $(d \times L)$ indices

\begin{align}
& \ \sum_{k\leq d,l \leq L} (WY + v \mathbf{1}^T_L – X)_{kl} (e_i y_j^T)_{kl} = 0^{d \times n} \\
\Leftrightarrow & \ \sum_{k,l} \bigl(WY + (\mu_x^{\text{emp}} – W \mu_y^\text{emp}) \mathbf{1}^T_L – X \bigr)_{kl} (e_i y_j^T)_{kl} = 0^{d \times n} \\
\Rightarrow & \ \sum_{l} \bigl(WY + (\mu_x^{\text{emp}} – W \mu_y^\text{emp}) \mathbf{1}^T_L – X \bigr)_{il} (e_i y_j^T)_{il} = 0^{d \times n}.
\end{align}

Where in the last step we ignored multiplications by zero. The last term is the dot product between the $j$th row of $Y$ and the $i$th row of $\bigl(WY + (\mu_x^{\text{emp}} – W \mu_y^\text{emp}) \mathbf{1}^T_L – X \bigr)$. Since this holds for all $i$ and $j$, in particular, we can write

\begin{align*}
&\ (WY + (\mu_x^{\text{emp}} – W \mu_y^\text{emp}) \mathbf{1}^T_L – X \bigr) Y^T = 0^{d \times n} \\
\Rightarrow & \ W \bigl[ Y Y^T – \mu_y^\text{emp} \mathbf{1}^T Y^T \bigr] = X Y^T – \mu_x^\text{emp}\mathbf{1}^T Y^T
\end{align*}

The top equation is exactly what we would have thought if we said that the Jacobian of $WY$ is $Y^T$.

The rest of the post is less important.

Note that $\mathbf{1}^T Y^T = (Y \mathbf{1})^T = L (\mu_y^\text{emp})^T$. Let's drop those \text{emp} superscripts.

\begin{align*}
\Rightarrow & \ W \bigl[ Y Y^T – L \mu_y \mu_y^T] = X Y^T – \mu_x \mu_y^T \, .
\end{align*}

In retrospect perhaps it would have been best to just acknowledge that the problem is separable in the rows of $W$, following e.g. the Willsky, Wornell and Shapiro notes, ch 3 p. 126.

Or perhaps we could have directly derived the normal equations using an orthogonality argument with respect to the trace inner product? It seems intuitive, since $Y$ and $\mathbf{1}$ are … they're like the span of something….
That would give:
\begin{align*}
& \ \text{Tr}\bigl( (W Y + v \mathbf{1}^T_L – X)^T \mathbf{1}^T \bigr) = 0 \\
\Rightarrow &\ \mathrm{Tr}\Bigl[ \mathbf{1}^T WY + \mathbf{1}^T v \mathbf{1}^T – \mathbf{1}^T X^T \Bigr] = 0 \\
\end{align*}

I don't see it.

Best Answer

I'll assume our model is $y_i \approx W x_i + v$, and we want to compute $W$ and $v$. We can simplify notation by including $v$ in $W$ (as a final column) and adding a component equal to $1$ at the end of each vector $x_i$. Our goal is to minimize $$ f(W) = \frac12 \| W X - Y \|_F^2. $$ We can do this by simply solving the equation $\nabla f(W) = 0$, but first we need to evaluate $\nabla f(W)$.

Notice that \begin{align} f(W + \Delta W) &= \frac12 \| WX - Y + \Delta W X \|_F^2 \\ &=\frac12 \underbrace{\| WX - Y \|_F^2}_{f(W)} + \underbrace{\langle WX - Y, \Delta W X \rangle}_{\text{Tr}((WX - Y)^T \Delta W X)} + \underbrace{\frac12 \| \Delta W X \|_F^2}_{\text{negligible}} \\ &\approx f(W) + \text{Tr}((WX - Y)^T \Delta W X) \\ &= f(W) + \text{Tr}(X (WX - Y)^T \Delta W) \\ &= f(W) + \langle (WX - Y) X^T, \Delta W \rangle. \end{align} In the second to last step we used the fact that $\text{Tr}(AB) = \text{Tr}(BA)$. The inner product denotes the Frobenius inner product, and the trace expression $\langle A, B \rangle = \text{Tr}(A^T B)$ is a standard formula for the Frobenius inner product.

Comparing with the equation $$ f(W + \Delta W) \approx f(W) + \langle \nabla f(W), \Delta W \rangle, $$ we discover that $$ \nabla f(W) = (WX - Y)X^T. $$ So, the normal equations are $$ (WX - Y) X^T = 0, $$ or equivalently $W X X^T = Y X^T$.