Solved – Testing assumptions of multiple regression

heteroscedasticitymultiple regression

When doing a multiple regression and testing for homoscedasticity
some people look at raw observations and others the residuals. Which
is correct?
Do you use raw data or residuals to test linearity?
Do you test the homoscedasticity for each IV against the DV or do
you put all IVs in at the same time and then test for
homoscedasticity?
When do you test the assumptions before running the analysis, after, or both?
What order do you do these things? Do you do any twice?
- test for linearity
- test for normal distribution
- test for equal variances
- run the multiple linear regression

Best Answer

The answers mostly derive from considering the question 'what is actually being assumed?'.

Do you know the actual assumptions?

(Note that the distributional assumptions are conditional, not marginal.)

1 When doing a multiple regression and testing for homoscedasticity some people look at raw observations and others the residuals. Which is correct?

What's the actual assumption here?

2 Do you use raw data or residuals to test linearity?

Which shows deviations from the model assumptions best?

3 Do you test the homoscedasticity for each IV against the DV or do you put all IVs in at the same time and then test for homoscedasticity?

See (1)

4 When do you test the assumptions before running the analysis, after, or both?

What exactly do you mean by 'running the analysis' here?

(If you use residuals, how would you do it before doing the calculations?)

If you mean 'before/after doing the formal inference based off the model fit', I'd normally say 'notionally before', but in what actual way would the order make a difference?

5 What order do you do these things?

This question is confusing. The last part:

test for linearity test for normal distribution test for equal variances run the multiple linear regression .

should have been right after the word 'things', like so:

5 What order do you do these things (check for linearity; check for normal distribution; check for equal variances; run the multiple linear regression)?

Again, if you use residuals for anything, how would you check (NB check, not test) those assumptions before calculating the residuals?

You can't check the assumption relating to conditional variance if linearity doesn't hold.

You can't check the assumption relating to normality if homoscedasticity doesn't hold.

Linearity is the basic assumption ('is my model for the mean appropriate?').

Variance is the next most important, and can't be checked until linearity is at least approximately satisfied

Normality is least important (if sample sizes aren't small... unless you're producing prediction intervals - then it matters even at large sample sizes) and can't be checked unless your data is at least approximately homoscedastic.

Do you do any twice?

Only where it would make a difference to do so.

Related Solutions

Multiple Regression – Using R^2 to Test the Linearity Assumption in Multiple Regression Analysis

Note that the linearity assumption you're speaking of only says that the conditional mean of $Y_i$ given $X_i$ is a linear function. You cannot use the value of $R^2$ to test this assumption.

This is because $R^2$ is merely the squared correlation between the observed and predicted values and the value of the correlation coefficient does not uniquely determine the relationship between $X$ and $Y$ (linear or otherwise) and both of the following two scenarios are possible:

High $R^2$ but the linearity assumption is still be wrong in an important way
Low $R^2$ but the linearity assumption still satisfied

I will discuss each in turn:

(1) High $R^2$ but the linearity assumption is still be wrong in an important way: The trick here is to manipulate the fact that correlation is very sensitive to outliers. Suppose you have predictors $X_1, ..., X_n$ that are generated from a mixture distribution that is standard normal $99\%$ of the time and a point mass at $M$ the other $1\%$ and a response variable that is

$$ Y_i = \begin{cases} Z_i & {\rm if \ } X_i \neq M \\ M & {\rm if \ } X_i = M \\ \end{cases} $$

where $Z_i \sim N(\mu,1)$ and $M$ is a positive constant much larger than $\mu$, e.g. $\mu=0, M=10^5$. Then $X_i$ and $Y_i$ will be almost perfectly correlated:

u = runif(1e4)>.99
x = rnorm(1e4)
x[which(u==1)] = 1e5
y = rnorm(1e4)
y[which(x==1e5)] = 1e5
cor(x,y)
[1] 1

despite the fact that the expected value of $Y_i$ given $X_i$ is not linear - in fact it is a discontinuous step function and the expected value of $Y_i$ doesn't even depend on $X_i$ except when $X_i = M$.

(2) Low $R^2$ but the linearity assumption still satisfied: The trick here is to make the amount of "noise" around the linear trend large. Suppose you have a predictor $X_i$ and response $Y_i$ and the model

$$ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i $$

was the correct model. Therefore, the conditional mean of $Y_i$ given $X_i$ is a linear function of $X_i$, so the linearity assumption is satisfied. If ${\rm var}(\varepsilon_i) = \sigma^2$ is large relative to $\beta_1$ then $R^2$ will be small. For example,

x = rnorm(200)
y = 1 + 2*x + rnorm(200,sd=5)
cor(x,y)^2
[1] 0.1125698

Therefore, assessing the linearity assumption is not a matter of seeing whether $R^2$ lies within some tolerable range, but it is more a matter of examining scatter plots between the predictors/predicted values and the response and making a (perhaps subjective) decision.

Re: What to do when the linearity assumption is not met and transforming the IVs also doesn't help?!!

When non-linearity is an issue, it may be helpful to look at plots of the residuals vs. each predictor - if there is any noticeable pattern, this can indicate non-linearity in that predictor. For example, if this plot reveals a "bowl-shaped" relationship between the residuals and the predictor, this may indicate a missing quadratic term in that predictor. Other patterns may indicate a different functional form. In some cases, it may be that you haven't tried to right transformation or that the true model isn't linear in any transformed version of the variables (although it may be possible to find a reasonable approximation).

Regarding your example: Based on the predicted vs. actual plots (1st and 3rd plots in the original post) for the two different dependent variables, it seems to me that the linearity assumption is tenable for both cases. In the first plot, it looks like there may be some heteroskedasticity, but the relationship between the two does look pretty linear. In the second plot, the relationship looks linear, but the strength of the relationship is rather weak, as indicated by the large scatter around the line (i.e. the large error variance) - this is why you're seeing a low $R^2$.

Solved – Ways of Testing Linearity Assumption in Multiple Regression apart from Residual Plots

What you can do is fit a model that relaxes the linearity assumption, using, e.g., splines, and compare it with the model that assumes linearity. For example, in R, for a linear regression model you can do something like that:

library("splines")

# linear effect of age on y
fm_linear <- lm(y ~ age + sex, data = your_data)

# nonlinear effect of age on y using natural cubic splines
fm_non_linear <- lm(y ~ ns(age, 3) + sex, data = your_data)

# F-test between the two models
anova(fm_linear, fm_non_linear)

Best Answer

Related Solutions

Multiple Regression – Using R^2 to Test the Linearity Assumption in Multiple Regression Analysis

Solved – Ways of Testing Linearity Assumption in Multiple Regression apart from Residual Plots

Related Question