Solved – heteroscedasticity, residual vs. independent X variables in a multiple regression

heteroscedasticitypredictorregressionresiduals

My Y variable varies between 0 and 1 with increments of 0.067. I have a lot of zeros in my data. My questions are:

Is the residual vs. fitted plot below OK. Or as suggested in this link: Heteroscedasticity in residuals vs. fitted plot and this link: How should I interpret this residual plot? is there a floor effect? Do I need to consider a different type of model, perhaps a logistic model?
Neter et al (1989) in p247 advises 'residuals should be plotted against each independent variables.' If moderate heteroscedasticity is not an issue in residual vs. fitted plot, do I have to check for heteroscedasticity in residual vs. each individual X variable? Asking the same question differently, if my residual vs. fitted plot is fine in model Y ~ X1 + X2 + X3 + X4 + X5 (not the one in the image above), but residual vs. X4 shows heteroscedasticity, what do I do?

Best Answer

You have count data - use a model appropriate for this: Based on your description of the data, and your residual plot, I suggest that your response variable is a proportion value based on a fixed denominator, which means that it is based on an underlying set of count data (i.e., positive integers up to a fixed known maximum value). That is why you get lines of values in your residual plot when you use OLS estimation. In such cases, the error term in them model is not normally distributed, and you will probably get a better fit from a model designed for count data (e.g., a binomial GLM).

Related Solutions

Solved – Heteroscedasticity in residuals vs. fitted plot

Your response variable isn't really continuous. It is presumably discrete (you can't buy .5 ounces, and moreover, beers only come in certain ounce sizes). In addition, no one can buy less than 0 ounces (you can clearly see the floor effect in your top--untransformed--residual plot). As a result, using an OLS regression (that assumes normal residuals) is likely to be inappropriate. You should probably try to use Poisson regression. In fact, a zero-inflated Poisson, negative binomial, or zero-inflated negative binomial are more likely what you will end up needing.

Solved – Best way to deal with heteroscedasticity

It's a good question, but I think it's the wrong question. Your figure makes it clear that you have a more fundamental problem than heteroscedasticity, i.e. your model has a nonlinearity that you haven't accounted for. Many of the potential problems that a model can have (nonlinearity, interactions, outliers, heteroscedasticity, non-Normality) can masquerade as each other. I don't think there's a hard and fast rule, but in general I would suggest dealing with problems in the order

outliers > nonlinearity > heteroscedasticity > non-normality

(e.g., don't worry about nonlinearity before checking whether there are weird observations that are skewing the fit; don't worry about normality before you worry about heteroscedasticity).

In this particular case, I would fit a quadratic model y ~ poly(x,2) (or poly(x,2,raw=TRUE) or y ~ x + I(x^2) and see if it makes the problem go away.

Best Answer

Related Solutions

Solved – Heteroscedasticity in residuals vs. fitted plot

Solved – Best way to deal with heteroscedasticity

Related Question