Linear Regression – Correlation of Error Terms Explained

linear regressionprobabilitystandard errorstatisticsvariance

I would like to ask for the interpretation, both mathematically and intuitively if possible, about the homoscedasticity of the variance of errors in linear regression models.

If there is correlation among the error terms, then how it would affect the estimated standard errors of regression coefficients $\beta_i's$, the confidence and prediction intervals (if we were to keep the assumption of homoscedasticity of errors and run the linear regression models) and how is it compared to the true standard errors $Var(\epsilon)$ (like underestimate or overestimate the true standard errors) and why?

My question arises from the section about "Correlation of Error Terms" in the book "Introduction to Statistical Learning". It is as follows:

An important assumption of the linear regression model is that the error terms, $\epsilon_1, \epsilon_2, …, \epsilon_n$, are uncorrelated. What does this mean? For instance, if the errors are uncorrelated, then the fact that $\epsilon_i$ is positive provides little or no information about the sign of $\epsilon_{i+1}$. The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95 % confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.
As an extreme example, suppose we accidentally doubled our data, leading to observations and error terms identical in pairs. If we ignored this, our standard error calculations would be as if we had a sample of size $2n$, when in fact we have only n samples. Our estimated parameters would be the same for the $2n$ samples as for the $n$ samples, but the confidence intervals would be narrower by a factor of $\sqrt2$!

I hope my question is clear. Many thanks in advance for sharing your insights on the question!

Best Answer

The given example is actually very good. A precise (rigorous) answer depends on the correlation structure. The statistical inference is dependent on the model assumptions. A severe violation will lead to very unreliable inference. E.g., let us take the model significance test, that is $$ F_1 = MSReg/MSres = \frac{SSreg_{(1)}/(p-1)}{SSres_{(1)}/(n-p-1)}, $$
where $p$ is the number of regression coefficients and $n$ the number of observations. If you just copy-pasting your data, clearly it adds no new information as the new $n$ observations are just copy of the existing ones. As such, it violates the assumption of independent\uncorrelated realization. $$ SSres_{(2)} = 2 SSres_{(1)}, \quad SSreg_{(2)} = 2 SSreg_{(1)}, $$ Hence the new $F$ statistic is $$ F_2 = MSReg_{(2)}/MSres_{(2)} = \frac{2 SSreg_{(1)}/(p-1)}{2 SSres_{(1)}/(2n-p-1)} = \frac{2n - p - 1}{ n - p - 1} F_1 , $$
i.e., $$ \frac{F_2}{F_1} = \frac{2n - p - 1}{ p - 1}, $$ namely, for a considerable sample size you almost doubled the $F$ statistic which reduces (much more significantly!) the p.value, i.e., you can erroneously infer that your new model is "very significant" where in fact it has nothing to add to the first model. This stems from the fact that you underestimate the variance of the error term, i.e., its unbiased estimator (for the first model) is given by $$ \hat{ \sigma }_1 ^ 2 = \frac{SSres}{n - p - 1}, $$ and for the second model you have $$ \hat{ \sigma }_2 ^ 2 = \frac{SSres}{2n - p - 1} = \frac{n - p - 1}{2n - p - 1} \hat{\sigma}_1 ^ 2, $$ this will narrow the CI for all the coefficients and once again give you false sense of stability. Taking it to extreme and copy-pasting the data $k >> 2$ times will boost this effect. This gives you the basic idea what happens with positive correlation. The intuition of this result is best explained in terms of information. Let us say that every independently realized observation gives you an amount of $I$ information about the actual process. Your inference procedure assumes that $n$ observations bears $nI$ information, where in fact - as stronger the correlation - that much less than $nI$ information you have. Hence, you gain false confidence about the validity of your fitted model.