Solved – How to fix dumthe variables when I calculate predicted probability on logistic regression

categorical datalogisticlogitprobability

My question is about predicted probabilities in logistic regression.

Let me make an example, analyze the relationship marriage (1: married, 0: single) as dependent variable and sex (1: male, 0: female), education level (make 4 dummy variables, EDU1: junior high school, EDU2: high school, EDU3: skilled school, EDU4: university), and age (continuous variable) as independent variables. I want to know how age affects marriage. The regression formula is:

$\ln{\frac{P(married)}{P(single)} = \alpha + \beta_1(sex) + \beta_2(EDU1) + \beta_3(EDU2) + \beta_4(EDU3) + \beta_5(age) + \epsilon}$

I understand if the independent variables are only continuous variables, then it's no problem to fix them on their mean. But I'm not sure how to fix when the model has some categorical/dummy variables as independent variables. I think there are two ways to calculate the predicted probability of this model. I'd like to ask which one is better.

  1. to fix the means of each variable and calculate predicted probability of age and marriage. When the mean of sex is 0.46, then put 0.46 to sex.

  2. to put the mode of each variable and calculate predicted probability of age and marriage. The mean of sex(0.46) means more than half of the case is female. So put 0 to sex.

If I take first way, the predicted probability isn't based on the real person(neither male nor female). However I can know the predicted probability on average person(but not real one). On the other hand, from the second way the predicted probability is based on real person, like female who had a degree of university. But I can't know other people. Of course when the categorical/dummy variables are not so many like this example model, it's possible to calculate all the conditions. But normally the models of my study(demography) have many categorical/dummy variables. So it's very difficult to show all conditions.

Which way is better?
Thanks!

Best Answer

If you're variable of interest is age, I'm not quite sure why you're worried about how you set your dummy/categorical variables.

The mean is better when speaking about people in general. It's the same as a variable like age--no one has precisely the specific mean age--but you're talking about the most representative which is the average.

If you're really concerned about the different probabilities of being married for different people, you can individually represent the types of people you're interested in, for example a female high school graduate.

However, unless you're included interaction terms, the effect of age on marriage is by construction independent of the other variables in the model as specified currently.