Generalized Least Squares – Estimating Residual Correlations

generalized linear modelgeneralized-least-squares

In a typical linear regression model, we assume that the errors/residuals w.r.t. predictions are i.i.d. and following a normal distribution with a given variance that is the same for all observations.

I.e. if we have features/covariates $\mathbf{X}_{m \times n}$, target/response variable $\mathbf{y}_{m}$ and fitted coefficients $\mathbf{\beta}_{n}$, then:
$$
(\mathbf{X} \mathbf{\beta} – \mathbf{y}) \sim N(0, \sigma^2)
$$

If the residuals are correlated and/or each have different variances – i.e.:
$$
(\mathbf{X} \mathbf{\beta} – \mathbf{y}) \sim N(\mathbf{0}_{m}, \mathbf{\Sigma}_{m \times m})
$$

Then coefficients that are estimated by simply minimizing squared errors will have certain undesirable properties, and better coefficients may instead be obtained by "generalized least squares" which takes into account those correlations and differing variances. Per Wikipedia GLS (generalized least squares) gets the coefficients $\mathbf{\beta}_{\text{GLS}}$ as:
$$
\mathbf{\beta}_{\text{GLS}} =
( \mathbf{X}^T \mathbf{\Sigma}^{-1} \mathbf{X} )^{-1}
\mathbf{X}^T \mathbf{\Sigma}^{-1} \mathbf{y}
$$


Now my question is: assuming we don't know the explicit relationships between residuals beforehand and that the data will not follow some obvious sparsity pattern like autoregressive data, how can such a correlation or covariance matrix $\mathbf{\Sigma}$ be obtained? From a logical POV, that kind of matrix would need to have more parameters than the data can provide. Is there a formula for them that would impose some restrictions? I guess in theory it could be bootstrapped, but I doubt a BS estimate would do unless the number of resamples is larger than what's reasonable to run in a computer.

Best Answer

As you pointed out yourself, in general, there is not sufficient data available to estimate the covariance $\Sigma$. One could think of the situation where there is lots of dense data where some binning technique would yield sufficient data to approximate $\Sigma$, but this is not possible in most cases.

So the whole point of GLS in the general situation is to incorporate expert knowledge like, as you mentioned, information about time series properties like being autoregressive. This is also pointed out on the Wikipedia page you cite, in the section on Feasible generalized least squares.