Solved – Improving Logistic Regression model’s summary output

generalized linear modellogisticrregression

Logistic Regression using R's glm package is giving me the following summary (snap of the few first variables).

My Data Set:

  • Dimensions: 1252 rows and 224 columns (after using model.matrix). The Data has been standardized.
  • Response variable is binary.
  • Trying to predict if an employee will leave the company, based on employee attributes

enter image description here

My Understanding:

The model does not give a good fit because:

  1. Residual Deviance > Null Deviance.
  2. p.value = 1 – pchisq(3676.5, 817) turns out to be 0.
  3. The first warning about 0 or 1 fitted probability message suggests that due to some predictor(s) the model might be giving perfect predictions
  4. Second warning on ‘rank deficiency’ suggests that there might be predictors that are linearly dependent on one another.

My Questions:

  1. How can I improve the model? I would like to see Residual Deviance < Null Deviance. I will invest time on dropping the linearly dependent variables in the model, but is there anything I should do first to test the ‘model’ itself, before revisiting my data? I am asking this because SVM worked quite well on the same data set.
  2. Why do I have such extreme coefficient values?
  3. Many answers to other posts state that ‘AIC’ is used to compare different
  4. The summary parameters (coefficients , std error and p-values) for many dummy factors obtained via model.matrix, like GSS_SEXM, is shown as 'NA'. Why is it so?
  5. logistic models. What is meant by ‘different’ here? Models trained on different data sets that bear different coefficients, like say different set of attributes?

Best Answer

As these data are based on employee records, you presumably have data on the time to quitting (length of employment), not just the fact of having quit. If so, this would be better modeled with survival analysis. Predicting the length of employment would seem to be of considerable value to the company.

Then the dependent variable is continuous, with those who haven't quit yet treated as "censored" observations. (We all do, eventually, end up leaving employment.)

Whether you model this as logistic or survival, you should carefully limit the number of variables under consideration or use a penalized method like LASSO or elastic net. The rule of thumb to avoid overfitting if you are not using a penalized method is to consider no more than one variable per 15 events. That would be the number who quit or otherwise left employment for survival analysis, or the smaller of those who quit/didn't quit for logistic (which, the more I think on it, seems less and less useful here). And in terms of the number of variables, each categorical variable counts as one less than the total number of categories (that's how many columns it contributes to the model matrix).

To make this concrete, say that 600 out of the 1252 cases represented people who left employment with the company. If you intend to do standard survival analysis, this rule of thumb means that you should enter no more than about 600/15=40 variables (columns of a model matrix) into your analysis, not the full model matrix with 224 columns. If only 300 people in your data set left employment, only 20 variables should be considered in standard survival analysis. The particular variables might best be selected based on your knowledge of the subject matter, or multiple correlated predictors might be combined into single predictors. If you need to evaluate more predictors than warranted by this rule of thumb you should use a penalized method.