This line in glm()
is doing you in:
mf$drop.unused.levels <- TRUE
which is effectively setting the argument of the same name of model.frame()
, which results in the behaviour you report.
The obvious solution is to not allow this to happen, to adjust your split sampling algorithm you use to produce your training and test sets. Instead of randomly sampling the rows of the data randomly sample within the levels of the factor.
If you don't want to handle the details yourself, try the caret package and its function createFolds()
:
## install.packages("caret")
library("caret")
X1 <- factor(rep(1:3, times = c(20, 30, 50))) ## dummy data for illustration
f <- createFolds(X1, k = 5)
f
which gives:
> f <- createFolds(X1, k = 5)
> f
$Fold1
[1] 5 7 10 20 21 24 29 31 34 42 51 52 59 68 75 76 82 83
[19] 85 94
$Fold2
[1] 4 9 11 18 22 23 30 38 40 44 55 58 62 66 70 72 80 81
[19] 87 92
$Fold3
[1] 1 12 14 16 27 37 41 48 49 50 53 60 61 63 64 74 79 88
[19] 89 97
$Fold4
[1] 3 15 17 19 25 28 32 35 36 43 54 57 67 69 71 73 78 86
[19] 98 99
$Fold5
[1] 2 6 8 13 26 33 39 45 46 47 56 65 77 84
[15] 90 91 93 95 96 100
The values in f
are the indices of the elements of X1
partitioning it into k = 5
groups, with sampling from within the levels of X1
as needed. Then take 1 of these folds at random as the test set.
## number of samples in levels of X1 for each split
> table(X1[-f[[1]]])
1 2 3
16 24 40
> table(X1[-f[[2]]])
1 2 3
16 24 40
> table(X1[-f[[3]]])
1 2 3
16 24 40
> table(X1[-f[[4]]])
1 2 3
16 24 40
> table(X1[-f[[5]]])
1 2 3
16 24 40
Do note that this algorithm doesn't guarantee that for small sample sizes that the stratified sampling will always work (i.e you may not be able to escape the missing levels issue in all cases).
The formalism used to write models in R can be quite handy, in this case with factor variables explicitly noted:
Y ~ age + calendar + factor(teacher) + factor(gender) + factor(prep_course)
You could expand to indicate more specifically that this is a logistic regression, and I suppose to indicate the reference levels of the factor variables (although that probably isn't so important for your presentation).
Best Answer
Using copulas is one way of generating dependent or (rank) correlated data from multivariable distributions that are not necessarily normal. Here is a simple example of doing this in Matlab: Simulating Dependent Random Variables Using Copulas. I am not sure if this can handle categorical variables though.