Solved – Logistic predict.glm() function error: factor has new levels

error messagegeneralized linear modellogisticr

I fit a logistic on three numeric continuous variables, followed by a categorical factor [Y, N].

logit2A <- glm(DisclosedDriver ~ VehDrvr_Dif+POL_SEQ_NUM+PRMTOTAL+SAFE_DRVR_PLEDGE_FLG, data = DF, family = "binomial") 

Fit looks wonderful.

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -2.204e+00  2.253e-01  -9.782  < 2e-16 ***
VehDrvr_Dif            2.918e-01  1.026e-01   2.845 0.004440 ** 
POL_SEQ_NUM           -1.893e-01  5.617e-02  -3.370 0.000751 ***
PRMTOTAL               1.109e-04  5.526e-05   2.006 0.044804 *  
SAFE_DRVR_PLEDGE_FLGY -7.220e-01  1.633e-01  -4.422 9.76e-06 ***

So obviously R took the Safe_Drvr_Pledge_Flg categorical factor variable and placed all 'N' values in reference or intercept as opposed to the listed 'Y'.

Now I want to take my fit and calculate the probabilities that my model determines. And here comes the error:

> DF$P_GLM<- predict.glm(logit2A, DF, type="response", se.fit=FALSE)
    Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
factor SAFE_DRVR_PLEDGE_FLG has new levels 

Umm… no it doesn't, because I just fit the model with the exact same data I'm trying to use for the prediction. What's the problem?

Trying to respond to first comment:
Don't know what you mean. I've got 3500 rows of data… It's a logistic regression on 4 continuous variables and one categorical. The categorical has two values, Y or N. My glm fit give the numbers given. I just want to plug it all back in with the predict function and it gives me that error. Here's the categorical variable:

 > DF$SAFE_DRVR_PLEDGE_FLG
 [1] Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y N Y Y N Y Y N Y Y N Y Y Y Y Y Y Y Y Y Y N Y Y N Y Y Y N Y Y Y Y Y N Y Y Y Y Y Y
 [60] Y Y Y Y N Y Y Y Y Y Y Y Y N Y Y Y N N Y N Y Y Y Y Y N Y Y N Y N N Y Y Y N Y Y Y Y N Y Y Y Y Y N Y N Y N Y Y Y Y Y N Y
 [119] N Y Y Y Y Y Y Y Y N Y Y Y Y Y Y N Y Y Y N Y Y Y N Y Y Y N N Y N N N Y N Y Y Y N N Y Y N Y Y Y Y N N Y Y Y Y N N Y N N
 Levels:  N Y

What do you mean by a working example? The fit works. The probability output of the predict function doesn't…

Best Answer

Using my magic crystal ball to see the output of str(DF)...

Aha! SAFE_DRVR_PLEDGE_FLG has another level, "", and that in that same row, one of the other variables is missing. So it's not using that row to do the fit, but when it tries to construct the data matrix to get the predictions, it can't because the level "" exists in the data set but not in the model.

You should probably fix your data set before fitting the model, but you can also get the predictions to work by not telling predict to use the DF data frame; it will then successfully work as it uses the data matrix from the model. However, this will have fewer predictions than there are rows in DF because those rows that were causing trouble were automatically discarded.

Example, based on @smillig's:

d <- data.frame(w=rnorm(100),
                x=rnorm(100), 
                y=sample(LETTERS[1:2], 100, replace=TRUE),
                z=sample(LETTERS[3:4], 100, replace=TRUE) )
d2 <- rbind(d, data.frame(w=1, x=1,y=NA, z=""))
fm2  <- glm(y ~ w + x + z, data=d2, family=binomial)

predict.glm(fm2, d2, type="response", se.fit=FALSE)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
# factor 'z' has new level(s) 

predict(fm2, type="response", se.fit=FALSE)
# No error, this works. However, it has length 100 even though d has 101 rows.