Regression – How to Interpret Higher R-squared Value in Logistic Regression with Fewer Variables

I am testing out 3 modeling approaches for malnutrition in children. Theoretically, distal determinants (education,poverty) operate through proximal determinants (water, sanitation) in determining malnutrition rates. The three logistic models, where stunting is a binary indicator for malnutrition, are:

// Proximal determinants only: both binary indicators
stunting ~ water + sanitation

// Distal determinants only: both categorical indicators
stunting ~ i.education + i.poverty

// Both proximal and distal determinants
stunting ~ water + sanitation + i.education + i.poverty

I am surprised to find that the r-squared value of the second model is higher than the third model, as calculated by the correlation between the predicted and actual values (stata):

predict predicted, xb
corr predicted stunting
local rsq = r(rho)

While I expected the strength of the relationship and statistical significance of the more proximal causes to decrease (as they were soaked up by the distal causes), I expected the combined model to have higher explanatory power (as measured by r-squared). Does anyone have any explanation as to why the second model has the most explanatory power? Let me know if I can provide additional information for answering this question.

Best Answer

You should be careful just relying on the R^2 when interpreting fit in a non-linear regression. You may want to compare the Log-Likelihood.

However, a decrease in R^2 with an increase in variable generally means the variables are interacting in a way that is not proving additional explanation of the model. One of the causes may be, as you point out, that there are issues with intervening variables in the model. If this is the case you may need to find an instrumental variable, or use a structural model.

Best Answer

Related Solutions

Solved – Causality, omitted variable bias

Solved – Meaning of p-value of logistic regression model variables

Related Question