Solved – I am running a logistic regression model and get very low predicted probabilities

estimationgoodness of fitlogisticpredictionresiduals

I am running a logistic model for catastrophic health expenditure (CHE) in Argentina. The sample size is 22500. I followed Xu et al. methodology to define CHE and adjusted for 8 socioeconomic variables. The are all significant.
Then I assessed the goodness of fit by looking at the H-L statistic (p-value>0.05) and the test indicates that the model fits the data well.
Then, to assess the discrimination of the fitted model I estimated the area under the ROC curve, which is 78.1%.
So far the model looks good.
Unfortunately when looking at predicted probabilities and residuals I realized that the predicted probabilities are very low. They range between .0001645 and .3187172. Therefore I have very large residuals and leverage. Of course this also translates into a large number of influential observations.

Any suggestions why this could be happening?
I have some ideas but I am not sure they are the right explanation:
Is it possible that the proportion of households with CHE==1 is very low (3%) in the sample and this could be affecting the results?

Can I still use this model as a descriptive model of CHE in Argentina?

Thanks you!
Mercedes

Best Answer

An unbalanced data set (like you suggest when you say the proportion of households with CHE = 1 make up 3% of your data) will influence the results of a logistic regression. See the answers to this question for instance.

In short, the class imbalance in your data set affects only the intercept term. If you are using your model to say something about you4 8 socioeconomic variables, you should be fine. If it is for prediction, then you could consider lowering your probability threshold below 0.5.