Solved – Missing factor levels after logistic regression glm()

logisticmissing datarregression

I would like to perform a logistic regression on a marketing data set (only categorical variables), of the form [outcome, $X_1$,$X_2$,$X_3$,$X_4$,$X_5$,$X_6$]

I split the data set into a training set and a validation set.

My problem: Predictor $X_1$ has originally 3 levels. The model using glm retains only 2 of these 3 levels.

When I try to run the model on the validation set (where $X_1$ still has 3 levels) I get an error message stating that the factor $X_1$ has now a new level.

How can I prevent the glm function from excluding factor levels? I don't mind if their coefficients are set to zero.

Best Answer

This line in glm() is doing you in:

mf$drop.unused.levels <- TRUE

which is effectively setting the argument of the same name of model.frame(), which results in the behaviour you report.

The obvious solution is to not allow this to happen, to adjust your split sampling algorithm you use to produce your training and test sets. Instead of randomly sampling the rows of the data randomly sample within the levels of the factor.

If you don't want to handle the details yourself, try the caret package and its function createFolds():

## install.packages("caret")
library("caret")

X1 <- factor(rep(1:3, times = c(20, 30, 50))) ## dummy data for illustration
f <- createFolds(X1, k = 5)
f

which gives:

> f <- createFolds(X1, k = 5)
> f
$Fold1
 [1]  5  7 10 20 21 24 29 31 34 42 51 52 59 68 75 76 82 83
[19] 85 94

$Fold2
 [1]  4  9 11 18 22 23 30 38 40 44 55 58 62 66 70 72 80 81
[19] 87 92

$Fold3
 [1]  1 12 14 16 27 37 41 48 49 50 53 60 61 63 64 74 79 88
[19] 89 97

$Fold4
 [1]  3 15 17 19 25 28 32 35 36 43 54 57 67 69 71 73 78 86
[19] 98 99

$Fold5
 [1]   2   6   8  13  26  33  39  45  46  47  56  65  77  84
[15]  90  91  93  95  96 100

The values in f are the indices of the elements of X1 partitioning it into k = 5 groups, with sampling from within the levels of X1 as needed. Then take 1 of these folds at random as the test set.

## number of samples in levels of X1 for each split
> table(X1[-f[[1]]])

 1  2  3 
16 24 40 
> table(X1[-f[[2]]])

 1  2  3 
16 24 40 
> table(X1[-f[[3]]])

 1  2  3 
16 24 40 
> table(X1[-f[[4]]])

 1  2  3 
16 24 40 
> table(X1[-f[[5]]])

 1  2  3 
16 24 40

Do note that this algorithm doesn't guarantee that for small sample sizes that the stratified sampling will always work (i.e you may not be able to escape the missing levels issue in all cases).

Related Question