Solved – R intercept in arima with xreg

rtime series

I am trying to understand what the reported intercept is showing when I use arima() with xreg=. The documentation says

"If am xreg term is included, a linear regression (with a constant term if include.mean is true and there is no differencing) is fitted with an ARMA model for the error term."

Thus I expect the intercept shown to come from the regression using xreg= as the X variables, before any arima model is done on those residuals.

However I tried to double check this by actually doing the regression with lm() and the intercept from that does not match what is reported from arima() (although the slope coefficient is pretty close).

Here is my example:

set.seed(456)
v = rnorm(100,1,1)
x = cumsum(v)  ; x = as.xts(ts(x)) 

# Fit AR(1) after taking out a time trend (aka, drift)
model5 = arima(x, order=c(1,0,0), xreg=1:length(x), include.mean=TRUE)
# Coefficients:
#         ar1     intercept  1:length(x)
#       0.8995     0.8815       1.1113
# s.e.  0.0422     1.6193       0.0265


# Double check
MyTime = 1:length(x)
model5_Part1 = lm(x ~ MyTime )
# Coefficients:
#      (Intercept)       MyTime  
#         1.856           1.096

The intercepts do not match, thus I do not know what the intercept is showing from the arima with xreg.

Note the example shown is based on "Issue 2" shown here http://www.stat.pitt.edu/stoffer/tsa3/Rissues.htm

Also note that this isn't a problem particular to modeling drift. Here is another example, where in addition to the intercept not matching, even the slope coefficient on the xreg= variable doesn't match what is shown from using lm(). This example has nothing to do with drift and uses the cars dataset as if it were time series data.

data(cars)
cars = as.xts(ts(cars, start=c(1980,1), freq=12))
model6 = arima(cars$speed, xreg=cars$dist, order=c(1,0,0), include.mean=TRUE)
# Coefficients:
#         ar1    intercept   dist
#       0.9979    15.2890  -0.0172
# s.e.  0.0030    10.5452   0.0055

model6_Part1 = lm(cars$speed ~ cars$dist)
# Coefficients:
#      (Intercept)    cars$dist  
#        8.2839        0.1656 

Intercepts do not match, slope coefficient does not match.

Best Answer

The idea behind the differences that you are finding in your modelling is the following.

Suppose you have a left-hand-side variable $y$ and a regressor $x$. Suppose also that the variables are stationary (for simplicity). Then you have the following equations:

(1a) $y_t=\alpha_0+\alpha_1x_t+u_t$
(1b) $u_t=\psi_1u_{t-1}+...+\psi_pu_{t-p}+v_t+\zeta_1v_{t-1}+...+\zeta_qv_{t-q}$

There are two ways you can proceed.

  • Case A: you estimate equation (1a) first, obtain the residuals, then use them to estimate equation (1b).
  • Case B: you estimate both equations simultaneously.

In case A, in the first stage you implicitly make an assumption that the residuals in equation (1a) are independent across time. Then in the second stage you revoke this assumption and model the residuals from equation (1a) as an ARMA process given by equation (1b). (So there is an intrinsic contradiction between the assumptions in stage 1 and stage 2 that shows the weakness of this approach.)

In case B, you assume that the residuals from equation (1a) follow an ARMA process as given in equation (1b) and use this information when estimating equation (1a).

Thus there is a difference between the two cases; no wonder you get different results in your modelling where I think you are indeed comparing case A and case B.

I think you incorrectly assume that "the intercept shown to come from the regression using xreg= as the X variables, before any arima model is done on those residuals". This is very close to the truth, but not exactly true; you miss the fact that the function arima() uses the assumption given by equation (1b) when estimating equation (1a).

Thus the difference that you are finding in your modelling is supposed to be there.

There is also a case C:

(2) $(y_t-\mu_y)=\phi_0+\phi_1(y_{t-1}-\mu_y)+...+\phi_p(y_{t-p}-\mu_y)+\epsilon_t+\theta_1\epsilon_{t-1}+...+\theta_q\epsilon_{t-q}+\beta_1x_t$

where $\mu_y$ is the mean of $y$. Not sure, but perhaps one should also subtract the mean of $x$ to make the last term $\beta_1(x_t-\mu_x)$ instead of just $\beta_1x_t$.

Actually, case C seems more natural to me than case B given how we specify the arguments in arima(). Of course, this is subjective judgement. However, the interpretation of the coefficient $\beta_1$ in case C is much more difficult than in case B.

A nice related piece by prof. Hyndman can be found here. I borrowed some ideas from there when answering this post, so I can only recommend reading the original piece!

Disclaimer: I am not an expert on this topic, but I hope I got the idea right. If not, corrections are mostly welcome!