Linear Regression – How to Handle Non-Normally Distributed Data [Duplicate]

biasbias correctionheteroscedasticitylinearregression

I am trying to understand the relationship between royalties received (independent variable) and health expenditures (dependent variable) for each municipality through a linear regression.

My hypothesis is that the effect of the independent variable on the dependent is very low.

The problem is that the data is not normally distributed:

enter image description here

So I made a log-log transformation to make the data more "normal":
enter image description here

The model result indicated that there is a significant positive relationship between the variables studied (1% increase in royalties received would result in a 0.08% increase in health expenditures).

enter image description here

The p-value was significant (0.0001).

The Multiple R-squared is 0.0232, which is apparently not a problem, as my goal is to show that one variable affects the other very little.

The problem was when checking the homoscedasticity, which returned the data shown in the graphs below.

enter image description here

Is there any adjustment I can make to correct the bias shown by the graphs? A cut in the studied municipalities, perhaps?

Thank you in advance!

Best Answer

You can do linear regression on non-normal data. The assumption is that your errors/residuals are normal which is what you want to test before making any inferences about p-values, coeffificent confidence intervals, etc. That being said, a log transform can often help to make this assumption more valid. I think what you are doing is fine as a start to understanding the coefficient value in your model.

Regarding the homoscedasticity, you could try adding/removing variables or other transformations source or possibly weighted least squares. You could also just consider a simple t-test where you treat the standard deviation as different between groups but you would have to split your data. I would also consider a poisson regression as you have count data but not sure if that will help much.

Thinking about this a bit more out of my own interest, your count data may just have heteroscedasticity by the nature of a poisson distribution (some discussion) - though you are doing the log transform. I would try a poisson regression without taking the log. If you decide to take the log and deal with the heteroscedasticity in your OLS, switching to weighted least squares makes sense and you can estimate the weights regressing the absolute residuals onto X. This is discussed here and they mention alternatives to WLS as well.