Linear Model Heteroscedasticity – Causes and Solutions

data transformationheteroscedasticitylinear modelregression

I have the following linear model:

Linear model residuals
Observations distribution

To address the residuals heteroscedasticity I have tried to apply a log transformation on the dependent variable as $\log(Y + 1)$ but I still see the same fan out effect on the residuals. The DV values are relatively small so the +1 constant addition before taking the log is probably not appropriate in this case.

> summary(Y)
Min.   :-0.0005647  
1st Qu.: 0.0001066  
Median : 0.0003060  
Mean   : 0.0004617  
3rd Qu.: 0.0006333  
Max.   : 0.0105730  
NA's   :30.0000000

How can I transform the variables to improve the prediction error and variance, particularly for the far right fitted values?

Best Answer

What is your goal? We know that heteroskedasticity does not bias our coefficient estimates; it only makes our standard errors incorrect. Hence, if you only care about the fit of the model, then heteroskedasticity doesn't matter.

You can get a more efficient model (i.e., one with smaller standard errors) if you use weighted least squares. In this case, you need to estimate the variance for each observation and weight each observation by the inverse of that observation-specific variance (in the case of the weights argument to lm). This estimation procedure changes your estimates.

Alternatively, to correct the standard errors for heteroskedasticity without changing your estimates, you can use robust standard errors. For an R application, see the package sandwich.

Using the log transformation can be a good approach to correct for heteroskedasticity, but only if all your values are positive and the new model provides a reasonable interpretation relative to the question that you are asking.