Solved – Logistic Regression and how to judge model fit and parameter influence

logisticmulticollinearityregression

I want to show statistical significant impact of about 15 different independent variable on a binary dependent variable (I am not a statistician). Some of my independent variables are term counts in text so they are likely correlated with the length of the text which is also one of the independent variables. I selected this set of variables because they seemed to impact the outcome at least individually. I performed matching on the length variable to try to control for this effect (length alone explains a lot of variance in my outcome/dependent variable).

Now I want to figure out which variables still matter in the presence of the other variables and what their relative importance is.

Question 1: What is the best way to deal with the possible collinearity of my independent variables?

Variant 1: Replace term_counts with residuals after regressing them on length.

Variant 2: First, do logistic regression model on all variables. All factors where the parameter is significantly different from zero are significant factors (this is a claim I would like to make).
And then models where I remove a single variable and show that the fit gets significantly worse to show that every single one matters, and by how much (how to do this exactly leads to Question 2).

Question 2: What is a reasonable and standard way of assessing superior fit of the different logistic regression models?
(also see Assessing logistic regression models)

What measure should I use to compare fit? Likelihood doesn't really work since it will always get better for more complex models. Classification seems stupid since our empirical estimates for p(Y=1) are only 20-40%.
We were also thinking about plotting p_empirical versus p_model after binning multiple observations to get a clearer picture. I am sure people have done this before and I would be very happy if you could point me to how to visualize the fit/performance.

Question 3: What is the best way to normalize the features so I would be able to compare the model parameters (i.e. how much more important is one feature over another?)?
Is it reasonable to scale everything between zero and one. Should it be "minus mean, divide by std". I have some ordinal, some binary, and a few continues dependent variables.

Thanks!

Best Answer

As far as figuring out which variables matter in the presence of other variables you might consider the Wald Test or the Likelihood-Ratio test.