Solved – I think the logistic model is overfitted even with Lasso? R gives me a perfect separation warning message

logisticrregressionseparation

I have a survey with 25 variables (160 observations for each var). Most are categorical, some are continuous. The variable of interest is categorical (binomial). If I do:

model <- glm(Y~., family=binomial(link='logit'), data=...)

I get a warning message:

1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred

I'm pretty sure there's no perfect separation, so the problem is probably overfitting right?

  1. I tried to do two things. The first thing I did was avoid any variable that gave me that warning message, this left me with 7 variables. I performed an AIC test and got it down to three:

    model_noerror <- glm(Y~sex+race+shoes)
    

    Obviously I shouldn't be avoiding variables that give me a separation problem. So I looked it up and people suggested overfitting, which can be handled with LASSO.

  2. So I used LASSO:

    set.seed(999) 
    cv.lasso  <- cv.glmnet(x, y, family='binomial') 
    penalty   <- cv.lasso$lambda.min 
    fit.lasso <- glmnet(x, y, family='binomial', alpha=1, lambda=penalty) 
    coef(fit.lasso)
    

    LASSO told me 10 variables were worth using. But if I include them all, again I get this "0 or 1 occurred" error message.

What could I do?

Edit 1:

Here is the summary(model) with the 10 variables LASSO suggested

summary(model)

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error z value Pr(>|z|)
(Intercept)   2.769e+01  1.717e+05   0.000    1.000
raceB      -3.756e+01  8.725e+03  -0.004    0.997
raceC      -1.809e+01  6.169e+03  -0.003    0.998
genderB       -2.131e+00  1.835e+00  -1.161    0.246
genderC        3.969e+01  3.638e+04   0.001    0.999
sexorD       -2.608e+01  3.580e+04  -0.001    0.999
sexorE       -1.496e+01  7.994e+04   0.000    1.000
collegeB -1.735e+01  7.994e+04   0.000    1.000
collegeC -1.615e+01  7.994e+04   0.000    1.000
...
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 94.242  on 119  degrees of freedom
Residual deviance: 24.264 on 85 degrees of freedom AIC: 94.264
Number of Fisher Scoring iterations: 22

That's not everything but I think you get the point.

Edit 2:

Here's the D matrix from the SVD of my model matrix.

x <- (model.matrix(Y~.,data=data))
svd(x)$d

[1] 1.106585e+02 2.284868e+01 2.004170e+01 1.791028e+01 1.303047e+01 1.134027e+01 1.048113e+01 9.474703e+00 8.521352e+00 8.066918e+00 7.931767e+00
[12] 7.606502e+00 7.435396e+00 6.935058e+00 6.687999e+00 6.382451e+00 6.179303e+00 5.904702e+00 5.533401e+00 5.428218e+00 5.335237e+00 5.237369e+00
[23] 4.874816e+00 4.612060e+00 4.544942e+00 4.475750e+00 4.349221e+00 4.135874e+00 4.022473e+00 3.842192e+00 3.708770e+00 3.656491e+00 3.518284e+00
[34] 3.371328e+00 3.305039e+00 3.087673e+00 2.772643e+00 2.743743e+00 2.678787e+00 2.536824e+00 2.515302e+00 2.347022e+00 2.270346e+00 2.180056e+00
[45] 2.162843e+00 2.118737e+00 2.043311e+00 1.900974e+00 1.823667e+00 1.762206e+00 1.686792e+00 1.615979e+00 1.583439e+00 1.430641e+00 1.408800e+00
[56] 1.257654e+00 1.162283e+00 1.110813e+00 1.089611e+00 1.023927e+00 9.276928e-01 7.876539e-01 7.600454e-01 6.976948e-01 6.817119e-01 6.322815e-01
[67] 5.933869e-01 5.189669e-01 4.339124e-01 3.667706e-01 2.506821e-01 2.274977e-01 1.084797e-14 1.084797e-14

Edit 3:

I've tried running a Principal Component Analysis. The results are not promising: https://i.imgur.com/Dt0uBHU.png It seems I can only remove a few variables

new_data <- subset(data, select =-c(Y))
library(dummies)
new_data <- dummy.data.frame(data)

pca.train <- new_data[1:120,]
pca.test <- new_data[121:161,]
pca.train <- pca.train[,!apply(pca.train, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
prin_comp <- prcomp(pca.train,scale.=T)


std_dev <- prin_comp$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
plot(prop_varex,xlab="Principal Component",ylab="Proportion of Variance Explained",type="b")

I got an "cannot rescale a constant/zero column to unit variance" error without the third line, so I added that in to remove constant/zero columns.

Edit 4:

If I run a glm with the first 5 Principal Components from my PCA analysis, the auc is .8864 which is pretty good I think.

But since my overall PCA analysis looks bad, I guess I'm stuck doing univariate analysis for each variable at a time?

Best Answer

Getting fitted values that are 0 or 1 is not itself a problem, nor it is necessarily a sign of over-fitting. Other things being equal, getting fitted probabilities near to 0 or 1 is good rather than bad, suggesting that the predictor variables are correlated with the response. The lack of convergence is the same -- it is just a consequence of the 0 or 1 fitted values.

Neither warning is an "error message". These warnings do however alert you that you will not be able to use the coefficient standard errors and p-values that are produced by the summary table:

summary(model)

Looking at the summary table you give, it is evident that most of the coefficients and standard errors in your table are actually infinite. It is well known that logistic regression does not yield usable z-statistics in this situation. You need to use likelihood ratio tests (LRTs) instead. The variables may well be highly significant by LRT even if the Wald p-values in the summary table were all near 1. See p-value from a binomial model glm for a binomial predictor for an example of this.

To see LRTs for your fit, you could try

anova(model, test="Chi")

but beware that the p-values you see are order dependent. This is a "sequential analysis of deviance table". Each variable is added to the model one at a time, in the same order you included them in the model formula. Each variable is adjusted for the variables above it in the table, so each p-value tests whether that variable adds something useful over the variables already in the model. If you change the order of the variables, then the p-values will change as well.

It is also evident that you do have over-fitting in the sense that you are including too many predictor variables that are collinear with one another and therefore mutually redundant. You cannot possibly interpret the logistic regression with 25 variables, and it is likely to be pretty useless for prediction as well.

Rather than examining individual p-values, you need to test the overall significance of the regression model. You can do this by comparing the full and null models:

model <- glm(Y~., family=binomial(link='logit'), data=...)
null.model <- glm(Y~1, family=binomial(link='logit'), data=...)
anova(null.model, model, test="Chi")

If the overall model is not significant, then there is nothing to be done. In that case, trying to do any model selection would be purposeless.

If the overall model is significant, then you have the problem of which variables to keep. I don't agree that the LASSO is useful here, because it has treated all the columns of the design matrix as continuous covariates. It has not taken into account the fact that columns are grouped by factor.

There are lots of ways to proceed, but I would be tempted to just try logistic regression with one factor or variable at a time, and seeing if any of the individual variables give you good prediction.

It might also be useful to examine collinearity of your variables. For example

 table(race, college)

would tell you if you have representatives of all races at all college levels. If race and college are highly correlated, then they might be mutually redundant in your model. Same for other variables such as gender and sexor.

A bit of common sense might be required, instead of use of automatic procedures.