Solved – Regularization for ARIMA models

arimalassoregularizationridge regressiontime series

I am aware of LASSO, ridge and elastic-net type of regularization in linear regression models.

Question:

Can this (or a similar) kind of penalized estimation be applied to ARIMA modelling (with a non-empty MA part)?

In building ARIMA models, it seems usual to consider a pre-selected maximum lag order ($p_{max}$,$q_{max}$) and then choose some optimal order $p \leqslant p_{max}$ and $q \leqslant q_{max}$ e.g. by minimizing AIC or AICc. But could regularization be used instead?

My further questions are:

Could we include all terms up to ($p_{max}$,$q_{max}$) but penalize the size of the coefficients (potentially all the way to zero)? Would that make sense?
If it would, has that been implemented in R or other software? If not, what was the trouble?

A somewhat related post can be found here.

Best Answer

Answering Question 1.

Chen & Chan "Subset ARMA selection via the adaptive Lasso" (2011)* use a workaround to avoid the computationally demanding maximum likelihood estimation. Citing the paper, they

propose to find an optimal subset ARMA model by fitting an adaptive Lasso regression of the time series $y_t$ on its own lags and those of the residuals that are obtained from fitting a long autoregression to the $y_t$s. <...> [U]nder mild regularity conditions, the proposed method achieves the oracle properties, namely, it identifies the correct subset ARMA model with probability tending to one as the sample size increases to infinity, and <...> the estimators of the nonzero coefficients are asymptotically normal with the limiting distribution the same as that when the zero coefficients are known a priori.

Optionally, they suggest maximum likelihood estimation and model diagnostics for the selected subset ARMA model(s).

Wilms et al. "Sparse Identification and Estimation of High-Dimensional Vector AutoRegressive Moving Averages" (2017) do even more than I asked for. Instead of a univariate ARIMA model, they take a vector ARMA (VARMA) in high dimensions, and they use an $L_1$ penalty for estimation and lag order selection. They present the estimation algorithm and develop some asymptotic results.

In particular, they employ a two-stage procedure. Consider a VARMA model $$ y_t = \sum_{l=1}^p \Phi_l y_{t-l} + \sum_{m=1}^q \Theta_m \varepsilon_{t-m} + \varepsilon_t $$ which needs to be estimated, but the lag orders $p$ and $q$ are uknown.

In Stage 1, they approximate the VARMA model by a high-order VAR model and estimate it using a Hierarchical VAR estimator which places a lag-based hierarchical group-lasso penalty on the autoregressive parameters.
(The lag order is set to be $\lfloor 1.5\sqrt{T} \rfloor$. The model equations are estimated jointly and the Frobenius norm of the errors $||y-\hat y||_2^F$ is minimized with a hierarchical group-lasso penalty on the regression coefficients.)
They obtain residuals $\hat\varepsilon := y - \hat y$ to be used as proxies for the true errors in Stage 2.
In Stage 2, they estimate a VARX model where X represents lagged residuals from Stage 1. That is, they minic a VARMA model but use estimated residuals in place of true errors, $$ y_t = \sum_{l=1}^{\hat p} \Phi_l y_{t-l} + \sum_{m=1}^{\hat q} \Theta_m \hat\varepsilon_{t-m} + u_t, $$ which allows applying the same estimator (hierarchical group-lasso) again just like in Stage 1.
($\hat p$ and $\hat q$ are set to be $\lfloor 1.5\sqrt{T} \rfloor$.)

The approach of Wilms et al. is implemented in the R package "bigtime".

References

Chen, K., & Chan, K. S. (2011). Subset ARMA selection via the adaptive Lasso. Statistics and its Interface, 4(2), 197-205.
Wilms, I., Basu, S., Bien, J., & Matteson, D. S. (2017). Sparse Identification and Estimation of High-Dimensional Vector AutoRegressive Moving Averages. arXiv preprint arXiv:1707.09208.

^{*Thanks to @hejseb for the link.}

Best Answer

Answering Question 1.

Related Solutions

Solved – Coefficients paths – comparison of ridge, lasso and elastic net regression

Solved – Using regularization when doing statistical inference

Related Question