Solved – Basic questions about stochastic gradient descent / Robbins and Monro algorithm

algorithmsestimationgradient descentonline-algorithmstime series

I have a LOT of time series observations and I would like to estimate a simple AR(1) model
$$
y_t =c+ \phi y_{t-1}+ \varepsilon_t \qquad \varepsilon_t \sim \text{N}(0, \sigma^{2})
$$
with parameters $\theta =\left\{ c, \phi,\sigma^{2} \right\}$. Because of the amount of observations I would like to use stochastic gradient descent (or "deterministic" Robbins and Monro algorithm) to estimate the model
$$
\theta^{i}_{t+1} = \theta^{i}_t + \gamma_t s^{i}_t , \quad \sum \gamma_t =\infty \quad \text{and}\quad \sum \gamma^{2}_t <\infty
$$
where $\theta^{i}$ is an element of $\theta$ and $s^{i}_t$ is the derivative of log likelihood contribution at time t with respect to $\theta^{i}$.

I have the following questions:

  • First of all is this supposed to work? What alternative online estimation procedures are out there? I have the feeling that this is really not robust and needs a lot of case by case tuning.

  • How to set $\gamma_t$? I know the standard choice of $\gamma_t$ is $t^{-2/3}$ which satisfy the requirements. My experience so far is that I should set $\gamma_t$ differently for different parameters as the data is more informative about the constant $c$ than about $\sigma^2$ and the gradients have totally different magnitude. If i use the same gamma some of the parameters would diverge (too large step size) while others would converge slowly. Is there a "stochastic Newton-Raphson" type algorithm which would scale the gradients properly? I guess trying different combinations would work but there are too many combinations if I would set different gammas for different parameters.

Any comment is appreciated!

Thanks in advance!

Best Answer

One assumption of stochastic gradient descent is that you should have independent identical distributed gradients, e.g. $s^i_t$ is independent over $t$, so that the law of large numbers ensures the stochastic gradient is a good approximation of the real gradient.

For $AR(1)$ and for most time series model, it is not.