Generalized Least Squares – Estimating Residual Correlations

generalized linear modelgeneralized-least-squares

In a typical linear regression model, we assume that the errors/residuals w.r.t. predictions are i.i.d. and following a normal distribution with a given variance that is the same for all observations.

I.e. if we have features/covariates $\mathbf{X}_{m \times n}$, target/response variable $\mathbf{y}_{m}$ and fitted coefficients $\mathbf{\beta}_{n}$, then:
$$
(\mathbf{X} \mathbf{\beta} – \mathbf{y}) \sim N(0, \sigma^2)
$$

If the residuals are correlated and/or each have different variances – i.e.:
$$
(\mathbf{X} \mathbf{\beta} – \mathbf{y}) \sim N(\mathbf{0}_{m}, \mathbf{\Sigma}_{m \times m})
$$
Then coefficients that are estimated by simply minimizing squared errors will have certain undesirable properties, and better coefficients may instead be obtained by "generalized least squares" which takes into account those correlations and differing variances. Per Wikipedia GLS (generalized least squares) gets the coefficients $\mathbf{\beta}_{\text{GLS}}$ as:
$$
\mathbf{\beta}_{\text{GLS}} =
( \mathbf{X}^T \mathbf{\Sigma}^{-1} \mathbf{X} )^{-1}
\mathbf{X}^T \mathbf{\Sigma}^{-1} \mathbf{y}
$$

Now my question is: assuming we don't know the explicit relationships between residuals beforehand and that the data will not follow some obvious sparsity pattern like autoregressive data, how can such a correlation or covariance matrix $\mathbf{\Sigma}$ be obtained? From a logical POV, that kind of matrix would need to have more parameters than the data can provide. Is there a formula for them that would impose some restrictions? I guess in theory it could be bootstrapped, but I doubt a BS estimate would do unless the number of resamples is larger than what's reasonable to run in a computer.

Best Answer

As you pointed out yourself, in general, there is not sufficient data available to estimate the covariance $\Sigma$. One could think of the situation where there is lots of dense data where some binning technique would yield sufficient data to approximate $\Sigma$, but this is not possible in most cases.

So the whole point of GLS in the general situation is to incorporate expert knowledge like, as you mentioned, information about time series properties like being autoregressive. This is also pointed out on the Wikipedia page you cite, in the section on Feasible generalized least squares.

Related Solutions

Solved – Non-Correlated errors from Generalized Least Square model (GLS)

The residuals from gls will indeed have the same autocorrelation structure, but that does not mean the coefficient estimates and their standard errors have not been adjusted appropriately. (There's obviously no requirement that $\Omega$ be diagonal, either.) This is because the residuals are defined as $e = Y - X\hat{\beta}^{\text{GLS}}$. If the covariance matrix of $e$ was equal to $\sigma^2\text{I}$, there would be no need to use GLS!

In short, you haven't done anything wrong, there's no need to adjust the residuals, and the routines are all working correctly.

predict.gls does take the structure of the covariance matrix into account when forming standard errors of the prediction vector. However, it doesn't have the convenient "predict a few observations ahead" feature of predict.Arima, which takes into account the relevant residuals at the end of the data series and the structure of the residuals when generating predicted values. arima has the ability to incorporate a matrix of predictors in the estimation, and if you're interested in prediction a few steps ahead, it may be a better choice.

EDIT: Prompted by a comment from Michael Chernick (+1), I'm adding an example comparing GLS with ARMAX (arima) results, showing that coefficient estimates, log likelihoods, etc. all come out the same, at least to four decimal places (a reasonable degree of accuracy given that two different algorithms are used):

# Generating data
eta <- rnorm(5000)
for (j in 2:5000) eta[j] <- eta[j] + 0.4*eta[j-1]
e <- eta[4001:5000]
x <- rnorm(1000)
y <- x + e

> summary(gls(y~x, correlation=corARMA(p=1), method='ML'))
Generalized least squares fit by maximum likelihood
  Model: y ~ x 
  Data: NULL 
       AIC      BIC    logLik
  2833.377 2853.008 -1412.688

Correlation Structure: AR(1)
 Formula: ~1 
 Parameter estimate(s):
      Phi 
0.4229375 

Coefficients:
                 Value  Std.Error  t-value p-value
(Intercept) -0.0375764 0.05448021 -0.68973  0.4905
x            0.9730496 0.03011741 32.30854  0.0000

 Correlation: 
  (Intr)
x -0.022

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.97562731 -0.65969048  0.01350339  0.70718362  3.32913451 

Residual standard error: 1.096575 
Degrees of freedom: 1000 total; 998 residual
> 
> arima(y, order=c(1,0,0), xreg=x)

Call:
arima(x = y, order = c(1, 0, 0), xreg = x)

Coefficients:
         ar1  intercept       x
      0.4229    -0.0376  0.9730
s.e.  0.0287     0.0544  0.0301

sigma^2 estimated as 0.9874:  log likelihood = -1412.69,  aic = 2833.38

EDIT: Prompted by a comment from anand (OP), here's a comparison of predictions from gls and arima with the same basic data structure as above and some extraneous output lines removed:

df.est <- data.frame(list(y = y[1:995], x=x[1:995]))
df.pred <- data.frame(list(y=NA, x=x[996:1000]))

model.gls <- gls(y~x, correlation=corARMA(p=1), method='ML', data=df.est)
model.armax <- arima(df.est$y, order=c(1,0,0), xreg=df.est$x)

> predict(model.gls, newdata=df.pred)
[1] -0.3451556 -1.5085599  0.8999332  0.1125310  1.0966663

> predict(model.armax, n.ahead=5, newxreg=df.pred$x)$pred
[1] -0.79666213 -1.70825775  0.81159072  0.07344052  1.07935410

As we can see, the predicted values are different, although they are converging as we move farther into the future. This is because gls doesn't treat the data as a time series and take the specific value of the residual at observation 995 into account when forming predictions, but arima does. The effect of the residual at obs. 995 decreases as the forecast horizon increases, leading to the convergence of predicted values.

Consequently, for short-term predictions of time series data, arima will be better.

Solved – Robust standard error in generalized least squares regression

Pretending to know the true variance is always a funny thing to see. I would never trust that sort of an assumption, unless it comes from a carefully designed experiment where an artificial correlation structure was imposed (although frankly I cannot really imagine a practical set up that would lead to this).

From the GEE perspective, you should be able to get more efficient estimates when your assumption about $\bf V$ is correct, as compared to OLS, but you would still want to use the sandwich variance estimator in the (highly unlikely) case that you are mistaken about the covariances. Usually, robustness to model assumptions is considered a greater issue than efficiency, unless you have really tiny sample sizes, and any 20% efficiency gain is a big deal. So I would run this with as the GLS (or feasible GLS, if you only know the structure of $\bf V$, but not the specific parameter values), but still correct for clustering using the sandwich variance estimator.

Best Answer

Related Solutions

Solved – Non-Correlated errors from Generalized Least Square model (GLS)

Solved – Robust standard error in generalized least squares regression

Related Question