R – Comparison of AR(p) Model Estimation: Differences Between lm and arima Functions in R

arimalinear modelrregressiontime series

I'm in the middle of reading a book Forecasting: Principles and Practice and simultaneously trying to code "by hand" all the things, for better understanding.

I've found something that I cannot explain:

Given a time-series generated by AR(2) process,
I try to model it as linear regression problem
Judging from what I've read so far- it should be perfectly possible to find matching coefficients using both methods.

Unfortunately, I get different results. Question is why?

AR(p) model is described by the equation:

$$
y_t = \epsilon_t + \phi_1y_{t-1} + \phi_2y_{t-2} + … + \phi_py_{t-p}
$$

I've lagged time-series by 1 and 2 and created classical dataframe with three columns:

current x
t-1 x
t-2 x
and performed regression.

R code example below:

library(xts)
x <- arima.sim(model = list(order=c(2,0,0), ar=c(0.6,0.3)), n=100)
x.ts <- as.xts(x)

# Generate dataframe
x.lag1 <- lag(x.ts, 1)
x.lag2 <- lag(x.ts, 2)
x.df <- data.frame(
  x.curr=x.ts[-c(1:2)],
  x.lag1=x.lag1[-c(1:2)],
  x.lag2=x.lag2[-c(1:2)]

)

# Fit regression
fit.x <- lm(x.curr ~ ., data=x.df)
summary(fit.x)

# Fit ARIMA
arima(x = x.ts, order = c(2,0,0), include.mean = FALSE)

Results are quite different:

 #Regression results
 Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
 (Intercept)  0.08173    0.11102   0.736    0.463    
 x.lag1       0.54860    0.10082   5.441 4.11e-07 ***
 x.lag2       0.15724    0.09883   1.591    0.115    

# Arima results
      ar1     ar2        intercept
      0.5919  0.1645     0.5572
s.e.  0.0983  0.1009     0.4268

Coefficients are similar, although slightly different. Questions:

why is it so? Is it this random error component included in every
observation in AR(p) process?
how could I model ARMA process with
differencing (non-stationary) using regression? Should I fit
regression model to differenced-and-lagged time series just like
that?

Best Answer

There are a few reasons. For one, your ARMA model doesn't include a mean/intercept. For another, the ARMA by default uses sum of squares only to find starting points for an iterative maximum likelihood scheme. Least squares regression (which throws away early data points), is usually called conditional sum of squares (CSS) in time series.

These should match up

summary(lm(x.curr ~ ., data=x.df))
arima(x = x.ts, order = c(2,0,0), include.mean = T, method="CSS") # note the mean and method arguments

Well you'll notice that there's a difference between lm's intercept and the arima's mean. The relationship is that the intercept equals the mean times $(1 - \phi_1 - \phi_2)$. You can verify that this works.

Also, and this makes everything much more confusing, the arima function will call its mean the intercept. This is a well-known issue covered in other questions such as this one, and is also explained here.

One more thing: your description for an AR(p) model is only true if you're looking at mean zero AR models. In general you can write it as $$ (1 - \phi_1B - \cdots - \phi_p B^p)(X_t - \mu) = \epsilon_t $$ where $\mu$ is the mean, or $$ (1 - \phi_1B - \cdots - \phi_p B^p)X_t = c + \epsilon_t $$ where $c$ is the intercept. This will help you with the intercept/mean dilemma above.

Finally, regarding your last question:

how could I model ARMA process with differencing (non-stationary) using regression? Should I fit regression model to differenced-and-lagged time series just like that?

You can either difference your nonstationary series, perhaps with diff in R, or by changing the order argument in your call to arima. For example, fitting an AR(3) to differenced data, is the same as an ARIMA(3,1,0), and so would require the parameter c(3,1,0).

Related Solutions

Solved – Equivalence of regression models and ARIMA models

In theory, the multiplicative SARIMA(1,0,0)(1,0,0)12 is not the same as regressing the data on its lags of order 1 and 12. To see this, the multiplicative SARIMA model is defined as:

$$ (1 - \phi_1L)(1 - \phi_2L^{12}) y_t = \varepsilon_t \,, \quad \varepsilon \sim NID(0, \sigma^2) \,, $$

where $L$ is the lag operator such that $L^i y_t = y_{t-i}$.

Expanding the product of polynomials we get:

$$ 1 - \phi_1Ly_t - \phi_2L^{12}y_t + \phi_1\phi_2L^{13}y_t = \varepsilon_t \\ y_t = \phi_1 y_{t-1} + \phi_2y_{t-12} - \phi_1\phi_2y_{t-13} + \varepsilon_t \,. $$

Thus, we can see that the multiplicative SARIMA model includes a lag of order 13 whose coefficient is restricted to be the product of coefficients $\phi_1\phi_{12}$. (For illustration, this result was used here.)

In practice, the unrestricted regression model with lags 1 and 12 may give a similar fit to the actual multiplicative SARIMA model, it will depend on the data. However, I wouldn't recommend working with ARIMA models as regression models since there are software packages that provide features and utilities specific for ARIMA models.

The strucchange package requires as input a model fitted by lm, i.e. the output of a linear regression model. My answer to this post gives an example about how to adapt the ideas implemented in strucchange to the case of ARIMA models. However, be aware that there is literature specific for the detection of structural breaks in the parameters of time series models.

Edit (added references)

Aue, A. and Horváth, L. (2012). "Structural breaks in time series". Journal of Time Series Analysis, 34(1), pp. 1-16. DOI: 10.1111/j.1467-9892.2012.00819.x.

Gombay, E. and Serban, D. (2009). "Monitoring parameter change in AR(p) time series models". Journal of Multivariate Analysis, 100(4), pp. 715-725. DOI: 10.1016/j.jmva.2008.08.005.

As the title mentions the equivalence between ARIMA and regression models, you may also be interested in this post, which compares the interpretation of an AR model with exogenous regressors and a linear regression model.

Solved – Interpretation of TSA::arimax output model is presented in R

This is an old post, but since there is no accepted answer, I still want to provide an explanation for future readers. You have the following R output:

Coefficients:
         ar1      ar2      ma1  intercept     xreg
      1.4825  -0.6613  -0.8516  52745.107  -1.0132
s.e.  0.0295   0.0294   0.0064     40.828   0.0012

sigma^2 estimated as 0.08929:  log likelihood = -105.98,  aic = 221.97

(1) y = intercept + xreg * x series name + n_t(error term)

(2) n_t(error term) = ar1 * n(t-1) + ar2 * n(t-2) + white noise epsilon + ma1 * epsilon(t-1)

(3) white noise epsilon ~ NID(0,0.08929)

To explain it further:

(1) This formula focuses on xreg coefficient. The R outputs gives you a coefficient for xreg -- the effect of the exogenous variable on Y at it's current value.

(2) This formula focuses on unpacking the error term left unexplained by formula (1). Formula 1 and 2 together is also called regression with ARIMA error (ARMA error in this case). This error term itself is an ARIMA (2,0,1) process, according to your output. This is what ar1, ar2, ma2 in the R output corresponds to.

(3) This formula is describing what is left after formula 1 and 2. sigma^2 describes the variance of the white noise series -- epsilon in most Econometric textbooks.

Best Answer

Related Solutions

Solved – Equivalence of regression models and ARIMA models

Solved – Interpretation of TSA::arimax output model is presented in R

Related Question