Solved – Ordered logit with (too many?) categorical independent variables

categorical dataordered-logitregression

I am doing some inferential analysis on some ordered dependent variables (about 70). Their scales range from 4-10 possible (ordered) responses. To give some context, it's all social data–happiness, feelings towards minorities, etc.–and my inferential question is on the social returns to university degrees.

For my binary and continuous dependent variables (not a part of this problem), I have a nice set of controls. Quite a few of these are categorical (did you go to a private high-school? What is your marriage status?), and others are continuous or dummy.

My problem is that when I use the same controls for the ordered variables, I get plenty of outcome level/categorical independent variable combinations which are empty. For example, there are no widowers who give themselves a happiness score of 2. As there are many such cases like this, the R functions polr and lrm don't work (if this is not the case, please let me know).

The choice, then, seems (to me) to be either:

a) reduce the set of controls to eliminate the empty cells, or

b) do it in ols.

I'm aware that for my categorical data, the assumptions of ols are not met. But given my task, I'm wondering how much would be lost from either solution?

Best Answer

Using ordinary least squares (OLS) does not solve the problem you are facing. It only assumes it away. If you are using OLS you are implicitly assuming that the different points on your scale are equally spaced. If you are comfortable with this assumption, push the OLS button and try to convince your audience.

I would tackle the problem differently. You have already mentioned of the solutions. Indeed, it could make sense to recode the control variables and to reduce the number of categories. Sparsely populated categories could be merged to other categories. Use your topical knowledge to merge and redefine categories.

You can also try to recode the dependent variables. Even on a 10 point scale, responses are usually clustered around some modalities. Again, guided by topical knowledge, you could redefine the dependent variable.

This topic is not new on CrossValidated. Under the Likert tag you will find plenty of discussions that may be of interest to you.