Solved – Logistic regression results (coefficients) counterintuitive

logisticmodelregressionregression coefficients

I ran a logistic regression model on SPSS with a dependent variable of yes/no whether you chose bus or not (the other being personal vehicle) and 5 independent variables (Waiting Time, Trip Time, Total Daily Expense, Overall Mode Comfort, and Overall Mode Ease-of-use). While the Omnibus and Hosmer-Lemeshow tests shows the model to be very good, and the significance for the most variables adequate, the result coefficients of some of the variables are somewhat off. This affects the probability estimation in that the predictor variable goes against intuition in real life conditions.

For example, the Comfort variable has a coefficient of -0.102821; this translates to a low probability when the Comfort value is high. Who wouldn't choose the bus when the Comfort value is over the top? I'm thinking that the coefficient should be a positive instead of negative. I should probably also point out that the intercept is negative, I'm not sure how much this effects the model.

So what seems to be the problem with my model?

Best Answer

If the beta parameter estimate is statistically significant, then the issue is not correlation among variables in your model.

Instead, the issue is one of omitted variable bias. This means there is a variable your model does not control for that is correlated with the comfort variable and the response variable. The impact of this omitted variable is absorbed by the comfort variable.

As a theoretical example, the buses that are the most comfortable might be located in the areas that are more wealthy. Perhaps people in wealthier areas are more likely to use their personal vehicle.

Because you did not control for the wealth of the driver (the variable was omitted) and this variable could be correlated with the buses that are more comfortable, the comfort variable could be negatively biased (potentially so much so to change signs of the parameter estimate).

Remember that when you interpret a beta parameter as holding everything else constant, you really mean that you are holding all of the variables in your model constant. Any variable not in your model that is correlated with a variable in your model is not considered to be held constant.

Omitted variable bias violates a fundamental assumption of linear models and leads to biased parameter estimates. Because of this negative impact, you should always include all variables you believe are part of the model. Unfortunately, there are no good ways to know what is omitted through the model alone. You must use your own experience and judgment to understand what might be omitted.

As a side note, if your goal is just to estimate the comfort variable, then you don't need to include every possible variable that you might have omitted in your model. Instead, you only need to include all variables that are correlated with the comfort and response variable.