This line in glm()
is doing you in:
mf$drop.unused.levels <- TRUE
which is effectively setting the argument of the same name of model.frame()
, which results in the behaviour you report.
The obvious solution is to not allow this to happen, to adjust your split sampling algorithm you use to produce your training and test sets. Instead of randomly sampling the rows of the data randomly sample within the levels of the factor.
If you don't want to handle the details yourself, try the caret package and its function createFolds()
:
## install.packages("caret")
library("caret")
X1 <- factor(rep(1:3, times = c(20, 30, 50))) ## dummy data for illustration
f <- createFolds(X1, k = 5)
f
which gives:
> f <- createFolds(X1, k = 5)
> f
$Fold1
[1] 5 7 10 20 21 24 29 31 34 42 51 52 59 68 75 76 82 83
[19] 85 94
$Fold2
[1] 4 9 11 18 22 23 30 38 40 44 55 58 62 66 70 72 80 81
[19] 87 92
$Fold3
[1] 1 12 14 16 27 37 41 48 49 50 53 60 61 63 64 74 79 88
[19] 89 97
$Fold4
[1] 3 15 17 19 25 28 32 35 36 43 54 57 67 69 71 73 78 86
[19] 98 99
$Fold5
[1] 2 6 8 13 26 33 39 45 46 47 56 65 77 84
[15] 90 91 93 95 96 100
The values in f
are the indices of the elements of X1
partitioning it into k = 5
groups, with sampling from within the levels of X1
as needed. Then take 1 of these folds at random as the test set.
## number of samples in levels of X1 for each split
> table(X1[-f[[1]]])
1 2 3
16 24 40
> table(X1[-f[[2]]])
1 2 3
16 24 40
> table(X1[-f[[3]]])
1 2 3
16 24 40
> table(X1[-f[[4]]])
1 2 3
16 24 40
> table(X1[-f[[5]]])
1 2 3
16 24 40
Do note that this algorithm doesn't guarantee that for small sample sizes that the stratified sampling will always work (i.e you may not be able to escape the missing levels issue in all cases).
Best Answer
This problem often indicates that you have a singular design matrix $X$. You can check that by seeing whether the rank of the cross-product $X^\top X$ equals the number of the columns of $X$.
This can easily be performed in R using
Here is an R-example with some simulated data