Machine Learning – How Does Early Stopping Act as a Regularizer?

gradient descentmachine learningmathematical-statisticsneural networksregularization

Reading the book "Deep Learning" by Goodfellow et al., in section 7.8 (for ref. Deep Learning – Chapter 7), I came across a demonstration of why early stopping can be interpreted as a regularization method. For a linear model the expression of the "distance" between the weights vector evaluated after $\tau$ steps and the optimal (zero gradient) weight vector $w^*$ is, under local Taylor series approximation of the loss function:
\begin{align}
w^{(\tau)} – w^*=(I-\epsilon H)(w^{(\tau-1)}-w^*),
\end{align}
where $H$ is the Hessian matrix of the loss function with respect to $w$ evaluated at $w^*$. After eigendecomposition, $H=Q\Lambda Q^T$:
\begin{align}
Q^T(w^{(\tau)}-w^*) = (I-\epsilon \Lambda)Q^T(w^{(\tau-1)} – w^*).
\end{align}
The author, then, assuming small $\epsilon: |1-\lambda_i|<1$ and initial point $w^{(0)}=0$, obtains the following identity:
\begin{align}
Q^T w^{(\tau)}=[I-(I-\epsilon\Lambda)^\tau]Q^T w^*.
\end{align}

I tried to prove it this way, partially by "reverse engineering": with $A=I-\epsilon\Lambda$
\begin{align}
Q^T w^{(\tau)} &=AQ^T w^{(\tau-1)} + (I-A)Q^T w^*,\\
Q^Tw^{(1)}&=(I-A)Q^T w^*, \text{ because $w^{(0)}=0$},\\
Q^T w^{(2)} &=AQ^Tw^{(1)}+(I-A)Q^Tw^*\\
&=A\color{red}{Q^T}(I-A)Q^Tw^*+(I-A)Q^Tw^*\\
&=[A\color{red}{Q^T}(I-A)+(I-A)]Q^Tw^*\\
&=(A\color{red}{Q^T}-A\color{red}{Q^T}A+I-A)Q^Tw^*\\
&=[I-(A\color{red}{Q^T}A-A\color{red}{Q^T}+A)]Q^Tw^*.
\end{align}
How does (or should, right?) $(AQ^TA-AQ^T+A)=A^2$ given the assumptions?

Edit: The expression I obtained for $Q^Tw^{(2)}$ is wrong. $Q^Tw^{(1)} = (I-A)Q^Tw^*$, not $\color{red}{Q^T}(I-A)Q^Tw^*$.

I propose a continuation with a proof by induction of the proposition
\begin{align}
Q^Tw^{(\tau)}=(I-A^\tau)Q^Tw^*,
\end{align}
whose base case ($\tau=1$) has already been demonstrated.
Checking if the proposition holds for $\tau+1$, assuming it holds for $\tau$:
\begin{align}
Q^Tw^{(\tau+1)}&=AQ^Tw^{(\tau)}+(I-A)Q^Tw^*\\
&= A(I-A^\tau)Q^Tw^*+(I-A)Q^Tw^*\\
&= (I-A^{\tau+1})Q^Tw^*\\
&= [I-(I-\epsilon\Lambda)^{\tau+1}]Q^Tw^* \quad\square
\end{align}

Best Answer

Let us denote for further brevity $u^{(\tau)} = Q^{\tau} w^{(\tau)}, u^{*} = Q^{\tau} w^{*}$. Then the equation connecting $\tau $ and $\tau - 1$ can be rewritten as: $$ u^{(\tau)} = (1 - \epsilon \Lambda) u^{(\tau- 1)} + \varepsilon \Lambda u^{*} $$ Let us suppose, that the solution has the form (the power law iteration makes this assumption natural): $$ u^{(\tau)} = (1 - \epsilon \Lambda)^{(\tau)} x + y $$ For some vectors $x, y$. Since $u^{(0)} = 0$ then $y = -x$.

The substitution of this ansatz gives: $$ (1 - \epsilon \Lambda)^{\tau} x - x = (1 - \epsilon \Lambda)^{\tau} x - (1 - \epsilon \Lambda) x + \varepsilon \Lambda u^{*} $$ Hence: $$ x = -u^{*} \rightarrow u^{(\tau)} = (I - (1 - \epsilon \Lambda)^{\tau}) u^{*} $$ Returning back to the original variables $w$ we get the desired results.

Best Answer

Related Solutions

Solved – Support vector regression versus kernel ridge regression

Related Question