[Math] Overcoming Linear Regression Assumptions

regressionstatistics

I'm a beginner in econometrics (learning on my own, and not from school) and I'm trying to build an intuition to understanding linear regression. We know that modeling real world data is bound to break many of the rules surrounding linear regression. Does it mean breaking any of those rules would render the use of linear regression ineffective? Or should we test the materiality of breaking those rules. How do we go about testing the materiality? Otherwise, what alternatives should we use?

The assumptions underlying linear regression states that:

  1. Expected value of the error term, conditional on the independent variable, is zero.
    Qn: What happens if it's not zero, how do you test for it, and when is it considered significant?

  2. All x, y observations are i.i.d.
    Qn: How would that affect my regression results if they weren't?

  3. It is unlikely that large outliers will be observed in the data.
    Qn: What should you do if large outliers are observed? should you ignore those data points? if yes, how would you know those data points can be ignored?

  4. Independent variable is uncorrelated with the error terms
    Qn: How do you test for this correlation, and are there ways to mitigate this correlation?

  5. Homoskedasticity
    Qn: what if the variance of the error term is not constant? What should you do?

Best Answer

I have had numerous discussions about this on CrossValidated. Linear regression is a very useful tool. The assumptions made when doing modeling in practice never hold exactly. George Box is known for say "All models are wrong. but some are useful." The key assumptions for applying OLS regression are 1. error component of the model are iid normal with 0 mean and constant variance 2. The covariates X are measured without error

Assumptions in 1 can be tested on the computed model residuals. Look at qxq plot of the residual or perform a goodness of fit test for normality such as the Shapiro-Wilk test. You can apply a t ets on the residuals to test for zero mean. If you plot the residuals against the covariates you can see visually if the variance changes with the covariate and if the mean departs from 0. Slight departures from normality will not hurt the model. But large departures and outliers will. If the assumptions are severely violated you can use a robust regression procedure. Robust regression methods use a different loss function that does not penalize so much for large individual errors. Least squares puts a high penalty on large errors and thus forces the curve to fit the outliers too well. Robust procedures like MAD do not.

Some problems with the residuals can be due to (1) missing covariates or (2) nonlinearity in the parameters. These can be remedied by introducing additional covariates in (1) or applying a nonlinear regression model in 2.

Related Question