Solved – Basic questions about stochastic gradient descent / Robbins and Monro algorithm

I have a LOT of time series observations and I would like to estimate a simple AR(1) model
$$
y_t =c+ \phi y_{t-1}+ \varepsilon_t \qquad \varepsilon_t \sim \text{N}(0, \sigma^{2})
$$
with parameters $\theta =\left\{ c, \phi,\sigma^{2} \right\}$. Because of the amount of observations I would like to use stochastic gradient descent (or "deterministic" Robbins and Monro algorithm) to estimate the model
$$
\theta^{i}_{t+1} = \theta^{i}_t + \gamma_t s^{i}_t , \quad \sum \gamma_t =\infty \quad \text{and}\quad \sum \gamma^{2}_t <\infty
$$
where $\theta^{i}$ is an element of $\theta$ and $s^{i}_t$ is the derivative of log likelihood contribution at time t with respect to $\theta^{i}$.

I have the following questions:

First of all is this supposed to work? What alternative online estimation procedures are out there? I have the feeling that this is really not robust and needs a lot of case by case tuning.
How to set $\gamma_t$? I know the standard choice of $\gamma_t$ is $t^{-2/3}$ which satisfy the requirements. My experience so far is that I should set $\gamma_t$ differently for different parameters as the data is more informative about the constant $c$ than about $\sigma^2$ and the gradients have totally different magnitude. If i use the same gamma some of the parameters would diverge (too large step size) while others would converge slowly. Is there a "stochastic Newton-Raphson" type algorithm which would scale the gradients properly? I guess trying different combinations would work but there are too many combinations if I would set different gammas for different parameters.

Any comment is appreciated!

Thanks in advance!

Best Answer

One assumption of stochastic gradient descent is that you should have independent identical distributed gradients, e.g. $s^i_t$ is independent over $t$, so that the law of large numbers ensures the stochastic gradient is a good approximation of the real gradient.

For $AR(1)$ and for most time series model, it is not.

Best Answer

Related Solutions

Solved – How to set the step size for stochastic gradient descent such that its provable it will converge

Solved – How does one do Stochastic Gradient Descent (SGD) on an objective function that has a regularizer

Related Question