Solved – How to predict a categorical variable with another categorical variable

categorical dataregression

I have a data set in the form:

df
  occupation  class
1     lawyer  upper
2     doctor  upper
3 unemployed middle
4    plumber  lower
5 unemployed  upper

The first variable occupation has only 8 values that it can take, and class can only take 3. I am trying to predict the class variable based on the occupation.

I have an idea about predicting continuous independent variables with continuous dependent variables (for example, linear regression). And logistic regressions with categorical dependent variables. But what about both sides of the equation being categorical? Does it even involve regressions or are there a simpler set of method to regress class ~ occupation?

And the opposite situation categorical_data ~ binary_dummy_variable. If X is a binary variable. How can I predict a categorical variable on that dummy variable?

I'm thinking that I would need to turn occupation into a dummy variable and explore the relationship that way. Especially since there is no specific scaling order to the variable. Perhaps turning occupation into a new binary variable "professional" "non-professional". But I still wouldn't know how to compare the new binary output to the class variable.

Best Answer

The easiest way is to break your data down into eight groups. You can do this by writing an occupation vector $x=(v_1,v_2,\cdots,v_7)$ where $v_i=1$ if you're in category $i$ and 0 otherwise. You need to set a reference category correponding to $v=(0,0,0,\cdots,0)$, say "lawyer". Then a simple logistic regression should do the job. Then you have a linear response: $Y=X\beta$ which you map with a logit function to a probability. This way you can see which categories influence more than others.