Solved – Which regression model to use with ordinal & skewed dependent variable

regression

I'm a bit stucked and just wondering about which regression technic to use.
Respondents were asked to provide a particular percentage value from 0-100. Sadly, these percentages are not accessible in the dataset, but were ranked into nine categories (0-8). The categories are thought of representing a performance measure, ranging from 0 ('low performer') to 8 ('highest performer'). To notice, the categories do consist of different value ranges, so not really even intervals.
In general, I'm interested in predicting the effect of some ordinal IVs on performance.
Referring to the descriptive statistics / frequencies, my DV appears to be heavily right skewed. Particularly, one third of all observations (N=400) is attributed to 0, whereas the rest appears to be bell shaped.

Many many thanks in upfront!

Chris

Best Answer

I think you may be mixing different issues in the same package. First things first. As @whuber has just pointed, it is not clear whether your DV is really ordinal, or only your IVs. That would configure very different scenarios, depending on such clarification. For instance, if your DV is also ordinal, you could (and maybe even should) go for a Ordinal Logistic Regression, just as mentioned by @Scortchi. The skewed distribution of such a DV would be less than a problem in that framework.

Actually, being precise, the distribution of even a continuous DV is, by itself, less of a problem. What you should be more worried about is the distribution of the residuals of your whole model, not of the variables themselves. It is a common mistake to pay attention to the distribution of the variables not of the residuals.

So, if your DV is continuous and the distribution of the residuals is heavily skewed, you should try first to transform your DV (log transformation being the standard, but in some cases you may need to power the DV or to square root it). If no transformation of the DV works, try also transforming any continuous IV. If you still have no good luck here, I would go for a robust regression, which allows for much less well behaved distribution of residuals. It is fairly usefull, for instance, when you have strong bell-shaped distribution of the residuals, i.e. heavy tails in both sides (a scenario that may be very hard to solve through transformation of variables). In R, it can be accomplished very well with lmrob function.

But anyway, at the end of your question, you mention that one third of the DV observations are equal zero. This case may also suggest that you may have a zero-inflated scenario and, thus, may need to look for models that can account for this kind of issue, such as the Zero-Inflated Regression models.

Hope this helps, but any further insights depend on you better clarifying your variables and the specification of your models.

Related Question