I'm looking for an example of time series data where a regression of y~x has autocorrelation in the residuals that leads to misinterpreting the model. This is for a class demonstration where I would ideally show them the results of an OLS model and the results of a GLS model with an appropriate correlation structure and explain the differences in the coefficients and their standard errors. Any examples out there that folks have used?
Solved – An example of autocorrelation in residuals causing misinterpretation
autocorrelationgeneralized-least-squaresleast squarestime series
Related Solutions
It's the difference between sequential and joint estimation of parameters, fundamentally. If you estimate $\{\beta_0, \beta_1, \rho\}$ jointly you'll get a different set of estimates for all three parameters (almost always) than if you estimate $\{\beta_0, \beta_1 | \rho=0\}$ then estimate $\rho$ based on the residuals, which is what sequential estimation does. By not taking the estimate of $\rho$ into account when estimating $\beta$, you lose efficiency, which propagates into the residuals, which in turn affects the quality of the estimate of $\rho$ based on those residuals.
Note that if $\rho$ actually is 0, or very close to it, and your sample size is small(ish), the sequential approach might actually be better, because it gets rid of the effect of randomness in the estimate of $\rho$ on the estimates of $\beta$. But if you knew you were in that situation, you'd probably be better off still just setting $\hat{\rho} = 0$ and going with OLS to estimate $\beta$. The trouble is, you don't know, unless you have some external information, so sticking with maximum likelihood or some other joint estimation methodology is still your best bet.
The residuals from gls
will indeed have the same autocorrelation structure, but that does not mean the coefficient estimates and their standard errors have not been adjusted appropriately. (There's obviously no requirement that $\Omega$ be diagonal, either.) This is because the residuals are defined as $e = Y - X\hat{\beta}^{\text{GLS}}$. If the covariance matrix of $e$ was equal to $\sigma^2\text{I}$, there would be no need to use GLS!
In short, you haven't done anything wrong, there's no need to adjust the residuals, and the routines are all working correctly.
predict.gls
does take the structure of the covariance matrix into account when forming standard errors of the prediction vector. However, it doesn't have the convenient "predict a few observations ahead" feature of predict.Arima
, which takes into account the relevant residuals at the end of the data series and the structure of the residuals when generating predicted values. arima
has the ability to incorporate a matrix of predictors in the estimation, and if you're interested in prediction a few steps ahead, it may be a better choice.
EDIT: Prompted by a comment from Michael Chernick (+1), I'm adding an example comparing GLS with ARMAX (arima) results, showing that coefficient estimates, log likelihoods, etc. all come out the same, at least to four decimal places (a reasonable degree of accuracy given that two different algorithms are used):
# Generating data
eta <- rnorm(5000)
for (j in 2:5000) eta[j] <- eta[j] + 0.4*eta[j-1]
e <- eta[4001:5000]
x <- rnorm(1000)
y <- x + e
> summary(gls(y~x, correlation=corARMA(p=1), method='ML'))
Generalized least squares fit by maximum likelihood
Model: y ~ x
Data: NULL
AIC BIC logLik
2833.377 2853.008 -1412.688
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.4229375
Coefficients:
Value Std.Error t-value p-value
(Intercept) -0.0375764 0.05448021 -0.68973 0.4905
x 0.9730496 0.03011741 32.30854 0.0000
Correlation:
(Intr)
x -0.022
Standardized residuals:
Min Q1 Med Q3 Max
-2.97562731 -0.65969048 0.01350339 0.70718362 3.32913451
Residual standard error: 1.096575
Degrees of freedom: 1000 total; 998 residual
>
> arima(y, order=c(1,0,0), xreg=x)
Call:
arima(x = y, order = c(1, 0, 0), xreg = x)
Coefficients:
ar1 intercept x
0.4229 -0.0376 0.9730
s.e. 0.0287 0.0544 0.0301
sigma^2 estimated as 0.9874: log likelihood = -1412.69, aic = 2833.38
EDIT: Prompted by a comment from anand (OP), here's a comparison of predictions from gls
and arima
with the same basic data structure as above and some extraneous output lines removed:
df.est <- data.frame(list(y = y[1:995], x=x[1:995]))
df.pred <- data.frame(list(y=NA, x=x[996:1000]))
model.gls <- gls(y~x, correlation=corARMA(p=1), method='ML', data=df.est)
model.armax <- arima(df.est$y, order=c(1,0,0), xreg=df.est$x)
> predict(model.gls, newdata=df.pred)
[1] -0.3451556 -1.5085599 0.8999332 0.1125310 1.0966663
> predict(model.armax, n.ahead=5, newxreg=df.pred$x)$pred
[1] -0.79666213 -1.70825775 0.81159072 0.07344052 1.07935410
As we can see, the predicted values are different, although they are converging as we move farther into the future. This is because gls
doesn't treat the data as a time series and take the specific value of the residual at observation 995 into account when forming predictions, but arima
does. The effect of the residual at obs. 995 decreases as the forecast horizon increases, leading to the convergence of predicted values.
Consequently, for short-term predictions of time series data, arima
will be better.
Best Answer
Chapter 5 in this textbook:
uses two examples to illustrate how to test and interpret the presence of autocorrelation in the residuals from the classical regression model. Below I summarize the idea of these examples.
As you mention that it is for a class demonstration, I would recommend you the software gretl. It contains several data sets from textbooks in Econometrics as well as the necessary tools to reproduce most of the examples on those textbooks. You can check the data sets that are available here. If you have any of these textbooks you will probably find there an example that is suited for your purposes.
The first of the examples mentioned before uses economic time series data (interest and bond rates, available here). The authors explain that the presence of autocorrelation in the residuals of a static regression of changes in the bond rate on changes in Treasury Bill rates suggests that the model is not correctly specified. As it often happens with economic data, some of the dynamics is not captured by a static model. A possible reason is that adjustments to deviations from an equilibrium relationship among the variables may take more than one period. Upon this interpretation, a dynamic model with one lag of the dependent and explanatory variables is fitted and tested against the static model.
The second example uses cross-section data (income and consumption of food in 48 households of comparable size, available here). A model for expenditure as a linear function of total consumption and average household size is fitted. Then the authors show that Breusch and Godfrey test suggests the presence of autocorrelation. Similarly to the other example, this result is interpreted as a misspecification of the model. In the context of cross-section data adding lags of variables does not make much sense. The authors propose instead a non-linear relationship among the variables, which is fitted through non-linear least squares.