I have a data set in the form:
df
occupation class
1 lawyer upper
2 doctor upper
3 unemployed middle
4 plumber lower
5 unemployed upper
The first variable occupation
has only 8 values that it can take, and class
can only take 3
. I am trying to predict the class
variable based on the occupation
.
I have an idea about predicting continuous independent variables with continuous dependent variables (for example, linear regression). And logistic regressions with categorical dependent variables. But what about both sides of the equation being categorical? Does it even involve regressions or are there a simpler set of method to regress class ~ occupation
?
And the opposite situation categorical_data ~ binary_dummy_variable
. If X
is a binary variable. How can I predict a categorical variable on that dummy variable?
I'm thinking that I would need to turn occupation
into a dummy variable and explore the relationship that way. Especially since there is no specific scaling order to the variable. Perhaps turning occupation into a new binary variable "professional" "non-professional". But I still wouldn't know how to compare the new binary output to the class
variable.
Best Answer
The easiest way is to break your data down into eight groups. You can do this by writing an occupation vector $x=(v_1,v_2,\cdots,v_7)$ where $v_i=1$ if you're in category $i$ and 0 otherwise. You need to set a reference category correponding to $v=(0,0,0,\cdots,0)$, say "lawyer". Then a simple logistic regression should do the job. Then you have a linear response: $Y=X\beta$ which you map with a logit function to a probability. This way you can see which categories influence more than others.