Solved – Methods for modeling discrete dependent variables and categorical independent variables

categorical datadiscrete datamodeling

I need to determine if there are relationships between discrete dependent variables and categorical independent variables. To give some examples, one discrete dependent variable would be salary and the corresponding categorical independent variable would be office location. Another discrete dependent variable would be days absent from work, the independent variable would be education level.

What would be the best way of modeling these types of data to determine relationships between them? Either in terms of a statistical test or a form of regression.

Best Answer

Salary is typically treated as continuous even though there is a smallest unit by which it can be incremented: typically a cent. However this smallest increment is so small compared to the amounts that we are talking about that that is ignored. The bigger problem is that the distribution of salary is typically rather skewed, so you may want to apply a log transformation or a log link function. I tend to prefer the latter, see: http://blog.stata.com/2011/08/22

Days absent from work is a count, so you can try using count models (the simplest example would be Poisson regression). However, you probably don't have "days absent from work", but "days absent from work in the last week" or "days absent from work in the last year". That puts an upper bound on your days which strictly speaking you should take into account. In practice you can ignore that too, as the number of days absent is usually far from the number of work days in a year.

I know you only gave some examples, but there is a common theme: models are by definition simplifications of reality, so you need to think about potential problems and choose carefully which of these you choose to ignore. Then you can communicate your choice, and the reasons for it, to your audience, and the audience may or may not buy your arguments. Both are good outcomes: if they do then that is good for your ego, if they don't then you have learned something new.

Notice that the question what you want to do with the results plays a key role in these choices. So there can be no answer of the kind: "Model X is best".