Solved – Preconditioning gradient descent

gradient descentoptimization

If one is using gradient descent to optimize over a vector space where each of the components is of a different magnitude, I know we can use a preconditioning matrix $P$ so that the update step becomes:

$$x_{n+1} = x_n -\gamma_n P^{-1}\ \nabla F(x_n)$$

The obvious approach for $P$ is to make it a diagonal matrix proportional to the approximate values of $x$, so that $Px\approx \bar{1}$.

Are there any other suggested methods for choosing $P$?

Do some of these methods lead to non-diagonal matrices?

Best Answer

Your question mentions that a diagonal $P$ gives you a parameter-dependent learning rate, which is like normalizing your input, except it assumes that your input has diagonal covariance.

In general using a preconditioning matrix $P$ is equivalent to normalizing $x$. Let the covariance of $x$ be $\Sigma = L L^T$ then $\tilde{x} = L^{-1}x $ is the normalized version of $x$.

$$ \Delta \tilde{x} = \nabla_\tilde{x} F(x) = L^T \nabla_x F(x) $$

so

$$ \Delta x = L \Delta \tilde{x} = LL^{T} \nabla_{x} F(x) = \Sigma \nabla_{x} F(x) $$

Doing this makes your objective (more) isotropic in the parameter space. It's the same as a parameter-dependent learning rate, except that your axes don't necessarily line up with the coordinates.

Here's an image where you can see a situation where you would need one learning rate on the line $y = x$, and another on the line $y=-x$, and how the transformation $L = ( \sigma_1 + \sigma_3 ) \operatorname{diag}(1, \sqrt{10})$ solves that problem.

enter image description here

Another way you could look this is that Newton's method would give you an optimization step: $$ x_{n+1} = x_n - \gamma_n [Hf|_{x_n}]^{-1} \nabla f(x_n) $$ And approximating the hessian as constant near the minimum with $P \approx Hf|_{x^\star} $ brings you closer to the fast convergence provided by Newton's method, without having to calculate the Hessian or make more computationally expensive approximations of the Hessian that you would see in quasi-Newton methods.

Note that for a normal distribution, the hessian of the log-loss is $ H = \Sigma^{-1} $, and these two perspectives are equivalent.