Linear Regression – How Data Violating Assumptions Appears

assumptionsheteroscedasticitynormality-assumptionregression

Linear regression has two assumptions about the residuals :

The residuals should have constant variance (for every level of the predictor).
The residuals should follow a normal distribution.

Is it possible to visualize how would the data itself, not the residuals, look like if one of these assumptions is violated?

I am seeking a visual example that would demonstrate clearly why these assumptions are necessary.

Best Answer

Here is an example where the variance of $\varepsilon$ is not constant (the variances of the residuals are larger for larger $x$):

    set.seed(2021)
    x1 <- 1:100
    epsilon1 <- rnorm(100, 0, x)
    y1 <- 3*x1 + 100 + epsilon1 
    plot(x1, y1)
    abline(lm(y1 ~ x1))

and an example where $\varepsilon$ is not normally distributed (and so the residuals are not normally distributed):

    set.seed(2021)
    x2 <- 1:100
    epsilon2 <- 100 * (rbinom(100, 1, 1/2) - 1/2)
    y2 <- 3*x2 + 100 + epsilon2 
    plot(x2, y2)
    abline(lm(y2 ~ x2))

1. Equality of variance

The variance of your dependent variable (residuals) should be equal in each cell of the design

This can certainly impact the significance level, at least when sample sizes are unequal.

(Edit:) An ANOVA F-statistic is the ratio of two estimates of variance (the partitioning and comparison of variances is why it's called analysis of variance). The denominator is an estimate of the supposedly-common-to-all-cells error variance (calculated from residuals), while the numerator, based on variation in the group means, will have two components, one from variation in the population means and one due to the error variance. If the null is true, the two variances that are being estimated will be the same (two estimates of the common error variance); this common but unknown value cancels out (because we took a ratio), leaving an F-statistic that only depends on the distributions of the errors (which under the assumptions we can show has an F distribution. (Similar comments apply to the t-test I used for illustration.)

[There's a little bit more detail on some of that information in my answer here]

However, here the two population variances differ across the two differently-sized samples. Consider the denominator (of the F-statistic in ANOVA and of the t-statistic in a t-test) - it is composed of two different variance estimates, not one, so it will not have the "right" distribution (a scaled chi-square for the F and its square root in the case of a t - both the shape and the scale are issues).

As a result, the F-statistic or the t-statistic will no longer have the F- or t-distribution, but the manner in which it is affected is different depending on whether the large or the smaller sample was drawn from the population with the larger variance. This in turn affects the distribution of p-values.

Under the null (i.e. when the population means are equal), the distribution of p-values should be uniformly distributed. However, if the variances and the sample sizes are unequal but the means are equal (so we don't want to reject the null), the p-values are not uniformly distributed. I did a small simulation to show you what happens. In this case, I used only 2 groups so ANOVA is equivalent to a two-sample t-test with the equal variance assumption. So I simulated samples from two normal distributions one with standard deviation ten times as large as the other, but equal means.

For the left side plot, the larger (population) standard deviation was for n=5 and the smaller standard deviation was for n=30. For the right side plot the larger standard deviation went with n=30 and the smaller with n=5. I simulated each one 10000 times and found the p-value each time. In each case you want the histogram to be completely flat (rectangular), since this means all tests conducted at some significance level $\alpha$ with actually get that type I error rate. In particular it's most important that the leftmost parts of the histogram to stay close to the grey line:

As we see, the left side plot (larger variance in the smaller sample) the p-values tend to be very small -- we would reject the null hypothesis very often (nearly half the time in this example) even though the null is true. That is, our significance levels are much larger than we asked for. In the right hand side plot we see the p-values are mostly large (and so our significance level is much smaller than we asked for) -- in fact not once in ten thousand simulations did we reject at the 5% level (the smallest p-value here was 0.055). [This may not sound like such a bad thing, until we remember that we will also have very low power to go with our very low significance level.]

That's quite a consequence. This is why it's a good idea to use a Welch-Satterthwaite type t-test or ANOVA when we don't have a good reason to assume that the variances will be close to equal -- by comparison it's barely affected in these situations (I simulated this case as well; the two distributions of simulated p-values - which I have not shown here - came out quite close to flat).

2. Conditional distribution of the response (DV)

Your dependent variable (residuals) should be approximately normally distributed for each cell of the design

This is somewhat less directly critical - for moderate deviations from normality, the significance level is so not much affected in larger samples (though the power can be!).

Here's one example, where the values are exponentially distributed (with identical distributions and sample sizes), where we can see this significance level issue being substantial at small $n$ but reducing with large $n$.

We see that at n=5 there are substantially too few small p-values (the significance level for a 5% test would be about half what it should be), but at n=50 the problem is reduced -- for a 5% test in this case the true significance level is about 4.5%.

So we might be tempted to say "well, that's fine, if n is big enough to get the significance level to be pretty close", but we may also be throwing a way a good deal of power. In particular, it's known that the asymptotic relative efficiency of the t-test relative to widely used alternatives can go to 0. This means that better test choices can get the same power with a vanishingly small fraction of the sample size required to get it with the t-test. You don't need anything out of the ordinary to be going on to need more than say twice as much data to have the same power with the t as you would need with an alternative test - moderately heavier-than normal tails in the population distribution and moderately large samples can be enough to do it.

(Other choices of distribution may make the significance level higher than it should be, or substantially lower than we saw here.)

Best Answer

Related Solutions

Linear Models – Assumptions and Handling Non-Normal Residuals

ANOVA Assumptions – Importance of Equality of Variance and Normality of Residuals

1. Equality of variance

2. Conditional distribution of the response (DV)

Related Question