Solved – glm returns NA as coefficient for logistic regression

generalized linear modellogisticmissing dataregressionregression coefficients

I am fitting a logistic regression for the response variable- 0 or 1. There are 15 explanatory variables- 10 are continuous and 5 are categorical with 3 levels each. I checked collinearity among the 10 continuous variables using correlation and they are okay. Using R, the glm function returns NA as the coefficient for one of the level of a categorical variable.

How can I fix this problem?

Please help.

Best Answer

This problem often indicates that you have a singular design matrix $X$. You can check that by seeing whether the rank of the cross-product $X^\top X$ equals the number of the columns of $X$.

This can easily be performed in R using

ncol(X) == qr(X)$rank

Here is an R-example with some simulated data

N <- 10
x <- rnorm(N)
z <- sample(c(1,2,3),N,replace=TRUE)
y <- sample(c(0,1),N,replace=TRUE)
data <- data.frame(y=y,x=x,z=as.factor(z))
model<-glm(y~x+z,data=data,family="binomial")
summary(model)

# Get model matrix ...
X <- model.matrix(~x+z,data=data)

# Get rank of model matrix
qr(X)$rank

# Get number of parameters of the model = number of columns of model matrix
ncol(X)

# See if model matrix has full rank
ncol(X) == qr(X)$rank

Related Solutions

Solved – Missing factor levels after logistic regression glm()

This line in glm() is doing you in:

mf$drop.unused.levels <- TRUE

which is effectively setting the argument of the same name of model.frame(), which results in the behaviour you report.

The obvious solution is to not allow this to happen, to adjust your split sampling algorithm you use to produce your training and test sets. Instead of randomly sampling the rows of the data randomly sample within the levels of the factor.

If you don't want to handle the details yourself, try the caret package and its function createFolds():

## install.packages("caret")
library("caret")

X1 <- factor(rep(1:3, times = c(20, 30, 50))) ## dummy data for illustration
f <- createFolds(X1, k = 5)
f

which gives:

> f <- createFolds(X1, k = 5)
> f
$Fold1
 [1]  5  7 10 20 21 24 29 31 34 42 51 52 59 68 75 76 82 83
[19] 85 94

$Fold2
 [1]  4  9 11 18 22 23 30 38 40 44 55 58 62 66 70 72 80 81
[19] 87 92

$Fold3
 [1]  1 12 14 16 27 37 41 48 49 50 53 60 61 63 64 74 79 88
[19] 89 97

$Fold4
 [1]  3 15 17 19 25 28 32 35 36 43 54 57 67 69 71 73 78 86
[19] 98 99

$Fold5
 [1]   2   6   8  13  26  33  39  45  46  47  56  65  77  84
[15]  90  91  93  95  96 100

The values in f are the indices of the elements of X1 partitioning it into k = 5 groups, with sampling from within the levels of X1 as needed. Then take 1 of these folds at random as the test set.

## number of samples in levels of X1 for each split
> table(X1[-f[[1]]])

 1  2  3 
16 24 40 
> table(X1[-f[[2]]])

 1  2  3 
16 24 40 
> table(X1[-f[[3]]])

 1  2  3 
16 24 40 
> table(X1[-f[[4]]])

 1  2  3 
16 24 40 
> table(X1[-f[[5]]])

 1  2  3 
16 24 40

Do note that this algorithm doesn't guarantee that for small sample sizes that the stratified sampling will always work (i.e you may not be able to escape the missing levels issue in all cases).

Best Answer

Related Solutions

Solved – Missing factor levels after logistic regression glm()

Related Question