Solved – Why update all parameters at each step of the Adam optimiser even when we have sparse observations

adamoptimizationstochastic gradient descent

Adam is a method for stochastic optimisation. The algorithm is given below.

Consider our parameters that we wish to optimise

$$\boldsymbol{\theta} = [\theta_1, \theta_2]$$

observations

$$\boldsymbol{x}_1, \dots, \boldsymbol{x}_N \in \mathbb{R}^2$$

and target values

$$ y_1, \dots, y_N \in \mathbb{R} $$

such that our stochastic objective function is

$$ f_t(\boldsymbol{\theta}) = (y_{I_t} – \boldsymbol{\theta} \cdot \boldsymbol{x}_{I_t})^2 $$

for some indexing set $I$ of the targets and observations.

What I am now confused about is if one of the dimensions of our observations $\boldsymbol{x}_i$ is sparse i.e. if we assume it is the first dimension, then in most cases the stochastic objective function would be

$$ f_t(\boldsymbol{\theta}) = (y_{I_t} – \theta_2 \cdot (\boldsymbol{x}_{I_t})_2)^2 \tag{1}$$

and only rarely

$$ f_t(\boldsymbol{\theta}) = (y_{I_t} – \theta_1 \cdot (\boldsymbol{x}_{I_t})_1 – \theta_2 \cdot (\boldsymbol{x}_{I_t})_2)^2 \tag{2}$$

In Adam all the dimensions of $\boldsymbol{\theta}$ are getting updated every time any observation occurs (based on their momentum). My question is then, why not treat the optimisation along each dimension according to the sparsity structure i.e. only update $\theta_1$ in case (2)?

Best Answer

EDIT:

Apparently tensorflow has a "Lazy Adam Optimizer" that only updates the gradient for variables whose indices appear in the current batch.

Lazy_Adam_Optimizer

This may be a good idea for very sparse data like language models.

Otherwise, here is my original response

Original:

In a general case you do not know which parts of the input, if any, are sparse. So assuming they are not makes the algorithm more generally applicable.

Furthermore, the momentum serves the purpose of "remembering" previous gradients. Since stochastic gradient descent trains on different examples at each step the momentum helps smooth out the updates.

So if an input is only rarely present than the momentum will help update the weights corresponding to it more often. Otherwise these weights would take a lot longer to converge.

Best Answer

Related Solutions

Solved – Why we call ADAM an a adaptive learning rate algorithm if the step size is a constant

SIRD Model Estimation – Least Squares Estimation for the SIRD Model in Python

1

2

3

Also

Related Question