Solved – Heteroskedasticity in linear regression model & data transformation


pls correct me if i'm wrong. of the econometrics' literature i've read so far, most mentioned heteroskedasticity is not a major problem empirically but multicollinearity would pose a greater concern to researchers and that data transformation would improve empirical results but not completely remove heteroskedasticity. Is this right?

by above, i've used R to conduct a basic linear regression with one regressor and three dummies and conducted Breusch-Pagan test using package 'car' and VIF test using package 'lmtest'. Thereafter, i conducted a second regression, log-transforming the dependent and independent variable and carried similar tests as aforementioned.

for both models, m1 for former and m2 for latter, and using VIF test, both presented less than 5 and reasonably to say multicollinearity isn't a problem

however, for Breusch-Pagan, m1 failed to reject null but m2 (logged) rejected it. can someone enlighten me why the result is not consistent with the theory above?

some related questions,

  1. would it better not to log-transform the model?

  2. i'm suspecting that the regressor is correlated with the residuals. is there test(s) 'out there' for endogeneity?

  3. if log model is preferred, what can i then do to 'lessen' heteroskedasticity in m2?

Thanks in advance!


Best Answer

Actually, I'd say just the opposite. Multicolinearity is often scoffed at as a concern. The only time this is a real issue is when one variable can be written as an exact linear function of others in the model (a male dummy variable would be exactly equal to a constant/intercept term minus a female dummy variable; hence, you can't have all three in your model). A prime example is Goldberger's comparison to "micronumerousity."

Perfect multicolinearity means that your model cannot be estimated; (not perfect) multicolinearity often leads to large standard errors, but no bias or real problems; heteroskedasticity means that your standard errors are incorrect and your estimates are inefficient.

First, I would create a model that yields the parameter estimates as I want to interpret them (level change, percent change, etc.) by using logs as appropriate. Then, I would test for heteroskedasticity. The most accepted option is to simply use robust standard errors to give you correct standard errors, but for inefficient parameter estimates. Alternatively, you can use weighted least squares to get efficient estimates, but this has become less common unless you know the relationship between the variances of your observations (they each depend upon the size of the observation---like population of a country). Indeed, in cross section econometrics using a data set of any real size, robust standard errors have become required irrespective of the outcome of a BP test; they are applied almost automatically.

There isn't a good test for endogeneity. You're real problem is that the regressor is correlated with the error; OLS will force the regressor to be uncorrelated with the residual. So you won't find any correlation there. Endogeneity is what makes econometrics hard and is a whole topic unto itself.