Linear Regression – How Data Violating Assumptions Appears

assumptionsheteroscedasticitynormality-assumptionregression

Linear regression has two assumptions about the residuals :

  • The residuals should have constant variance (for every level of the predictor).

  • The residuals should follow a normal distribution.

Is it possible to visualize how would the data itself, not the residuals, look like if one of these assumptions is violated?

I am seeking a visual example that would demonstrate clearly why these assumptions are necessary.

Best Answer

Here is an example where the variance of $\varepsilon$ is not constant (the variances of the residuals are larger for larger $x$):

    set.seed(2021)
    x1 <- 1:100
    epsilon1 <- rnorm(100, 0, x)
    y1 <- 3*x1 + 100 + epsilon1 
    plot(x1, y1)
    abline(lm(y1 ~ x1))

enter image description here

and an example where $\varepsilon$ is not normally distributed (and so the residuals are not normally distributed):

    set.seed(2021)
    x2 <- 1:100
    epsilon2 <- 100 * (rbinom(100, 1, 1/2) - 1/2)
    y2 <- 3*x2 + 100 + epsilon2 
    plot(x2, y2)
    abline(lm(y2 ~ x2))

enter image description here