Multiple Linear Regression With Lags – Mechanical Differences From Time Series

arimaleast squaresmultiple regressionregressiontime series

I'm a graduate from business and economics who's currently studying for a master's degree in data engineering. While studying linear regression (LR) and then time series analysis (TS), a question popped into my mind. Why create a whole new method, i.e., time series (ARIMA), instead of using multiple linear regression and adding lagged variables to it (with the order of lags determined using ACF and PACF)? So the teacher suggested that I write a little essay about the issue. I wouldn't come to look for help empty-handed, so I did my research on the topic.

I knew already that when using LR, if the Gauss-Markov assumptions are violated, the OLS regression is incorrect, and that this happens when using time series data (autocorrelation, etc). (another question on this, one G-M assumption is that the independent variables should be normally distributed? or just the dependent variable conditional to the independent ones?)

I also know that when using a distributed lag regression, which is what I think I'm proposing here, and using OLS to estimate parameters, multicollinearity between variables may (obviously) arise, so estimates would be wrong.

In a similar post about TS and LR here, @IrishStat said:

… a regression model is a particular case of a Transfer Function Model also known as a dynamic regression model or an XARMAX model. The salient point is that model identification in time series i.e. the appropriate differences, the appropriate lags of the X's , the appropriate ARIMA structure, the appropriate identification of unspecified deterministic structure such as Pulses, level Shifts,Local time trends, Seasonal Pulses, and incorporation of changes in parameters or error variance must be considered.

(I also read his paper in Autobox about Box Jenkins vs LR.) But this still does not resolve my question (or at least it doesn't clarify the different mechanics of RL and TS for me).

It is obvious that even with lagged variables OLS problems arise and it is not efficient nor correct, but when using maximum likelihood, do these problems persist? I have read that ARIMA is estimated through maximum likelihood,so if the LR with lags is estimated with ML instead of OLS, does it yield the "correct" coefficients (lets assume that we include lagged error terms as well, like an MA of order q).

In short, is the problem OLS? Is the problem solved applying ML?

Best Answer

Why create a whole new method, i.e., time series (ARIMA), instead of using multiple linear regression and adding lagged variables to it (with the order of lags determined using ACF and PACF)?

One immediate point is that a linear regression only works with observed variables while ARIMA incorporates unobserved variables in the moving average part; thus, ARIMA is more flexible, or more general, in a way. AR model can be seen as a linear regression model and its coefficients can be estimated using OLS; $\hat\beta_{OLS}=(X'X)^{-1}X'y$ where $X$ consists of lags of the dependent variable that are observed. Meanwhile, MA or ARMA models do not fit into the OLS framework since some of the variables, namely the lagged error terms, are unobserved, and hence the OLS estimator is infeasible.

one G-M assumption is that the independent variables should be normally distributed? or just the dependent variable conditional to the independent ones?

The normality assumption is sometimes invoked for model errors, not for the independent variables. However, normality is required neither for the consistency and efficiency of the OLS estimator nor for the Gauss-Markov theorem to hold. Wikipedia article on the Gauss-Markov theorem states explicitly that "The errors do not need to be normal".

multicollinearity between variables may (obviously) arise, so estimates would be wrong.

A high degree of multicollinearity means inflated variance of the OLS estimator. However, the OLS estimator is still BLUE as long as the multicollinearity is not perfect. Thus your statement does not look right.

It is obvious that even with lagged variables OLS problems arise and it is not efficient nor correct, but when using maximum likelihood, do these problems persist?

An AR model can be estimated using both OLS and ML; both of these methods give consistent estimators. MA and ARMA models cannot be estimated by OLS, so ML is the main choice; again, it is consistent. The other interesting property is efficiency, and here I am not completely sure (but clearly the information should be available somewhere as the question is pretty standard). I would try commenting on "correctness", but I am not sure what you mean by that.