GLM – Removing Intercept from GLM for Multiple Factorial Predictors Only Works for First Factor in Model

binomial distributioncategorical datacategorical-encodinggeneralized linear modelintercept

I am running a binomial logistic regression with a logit link function in R. My response is factorial [0/1] and I have two multilevel factorial predictors – let's call them $a$ and $b$ where $a$ has 4 factor levels $(a_1,a_2,a_3,a_4)$ and $b$ has 9 factor levels $(b_1,b_2,\dotsc,b_9)$. Therefore:

mod <- glm(y ~ a + b, family = binomial(logit), data=pretend)
summary(mod)

The model output would then show all the information about the model as well as the coefficients.

There is a factor level for both a and b missing (a1 and b1) from the summary output. I understand that it is constrained in the "intercept" of the model.
I have read that if I want to remove the intercept term and see the estimates for those factor levels I can just add -1 to the model formula, i.e.:

mod2 <- glm(y ~ a + b - 1, family=binomial(logit), data=pretend)
summary(mod2)

In the new model (mod2) the intercept term is then gone and variable a's factor-level a1 is given amongst the list of coefficients. But, variable b's factor-level b1 is still missing and given that there is no intercept term anymore, how can I interpret the odds-ratio for that factor level then?

Could someone please explain to me how to get the coefficient for b1 too and why this is happening?

Best Answer

That trick of getting a parameter for each level of the factor by removing the intercept only works when there is only one factor, as you have seen. You can understand why by counting degrees of freedom: Let factor $a$ have $a$ levels, factor $b$ with $b$ levels. Then factor $a$ have $a-1$degrees of freedom, which means that the indicator matrix with $a$ columns representing with, with a $1$ in each row for the level present at that row, has rank $a-1$. Likewise factor $b$ has $b-1$ degrees of freedom. The intercept has one degree of freedom. So the model formula $ ~ a + b$ (which really is $ ~ a + b + 1$) has $1 + a-1 + b-1 = a+b-1$ degrees of freedom. Removing the intercept (model formula $ ~ a + b - 1$) represents the same model, only the parametrization changed. So it must also have $ a + b - 1 $ degrees of freedom. That $-1$ shows that that there cannot be $a+b$ parameters, so one of the factors still must get one parameter less than number of levels.

That explains what you have seen. But still you can get a coefficient for the missing level of $b$, which should be zero, simply. (depending on the contrasts you are using).

To make this a bit more explicit let us see at an example. I will use R for the matrix algebra. To make design matrices (in R parlance "model matrices") from factors, we need to define contrast functions. I use the default:

> options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

First we make two factors for a simple, fully crossed design:

a  <- factor(rep(letters[1:3], 3))
b  <- factor(rep(letters[1:3], each=3))

Then design matrices for each of them:

> X1 <- model.matrix( ~ a-1)
> X2 <- model.matrix( ~b-1)
> X1
  aa ab ac
1  1  0  0
2  0  1  0
3  0  0  1
4  1  0  0
5  0  1  0
6  0  0  1
7  1  0  0
8  0  1  0
9  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.treatment"

> X2
  ba bb bc
1  1  0  0
2  1  0  0
3  1  0  0
4  0  1  0
5  0  1  0
6  0  1  0
7  0  0  1
8  0  0  1
9  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$b
[1] "contr.treatment"

Each of them, separately, is of full rank:

library(MASS)
library(Matrix)  
> Matrix::rankMatrix(X1)
[1] 3
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.998401e-15
> Matrix::rankMatrix(X2)
[1] 3
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.998401e-15

But when combined there is a rank deficit, so they must have one dimension "in common":

rankMatrix(cbind(X1, X2))
[1] 5
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.998401e-15

To identify the common dimension we use the Null() function from package MASS, calculating the null space:

 Null(t(cbind(X1, X2)))
           [,1]
[1,] -0.4082483
[2,] -0.4082483
[3,] -0.4082483
[4,]  0.4082483
[5,]  0.4082483
[6,]  0.4082483

Yes, the common dimension is the constant vector.

Related Solutions

Solved – Can one obtain a p-value for all levels of factor when doing a GLM

You could fit the model with and without the categorical variable and do a likelihood ratio test to determine it's overall significance.

Solved – Level of factor taken as intercept

That's how "treatment contrasts" work. One column of the model matrix is taken by the "first" factor in that simple model. Each statistical system chooses a default contrast strategy, so R's is different that SAS or SPSS. If the model were more complex with multiple factor predictors, then the "Intercept" would apply to the cases who all had the base-level of the various factors. If there were continuous covariates then the intercept would be the predicted "effect" for a hypothetical case with all factors at the base level and all continuous predictors at zero. (Obviously this might not be a physically interpretable scenario.) You could in this instance use a different formula to get the labeling as you expected with:

glm(formula = Survive ~ Group + 0, family = binomial, data = dat2)

This is my attempt to reconstruct the results of that call:

Call: glm(formula = Survive ~ Group + 0, family = binomial, data = dat2) 
Deviance Residuals: Min    1Q   Median       3Q   Max
                 -1.354 -1.177   0.000    1.177 1.354 
Coefficients: Estimate Std. Error z value Pr(>|z|) 
GroupControl    0.4055     0.6455   0.628     0.53 
GroupCtrl Trt   0.0000     0.6325   0.000     1.00 
GroupTreatment -0.4055     0.6455  -0.628     0.53 
(Dispersion parameter for binomial family taken to be 1)

So that shows that the coefficient for the "Ctrl Trt"-Group was zero so that further implies an exactly 50% survival in that group. When you omit the Intercept in a single factor model, each of coefficients refer only to the log-odds for the individual factor levels. The "Treatment"-Group coefficient suggest that of a group of 10 subject that 4 out of 10 survived since exp(-.4055) [1] 0.6666434 is very close to 4/(10-4). And in your "Ctrl"-Group there was 6 out of ten survivors since exp(0.4055) [1] 1.500052 is very close to 6/(10-6). (Remembering that we are modeling odds, not probabilities.)

In general, it's better (as in less confusing to the "uninitiated") to not omit the intercept, but for a single factor model it can be helpful.

I'm actually having difficulty figuring out how you could have produced that particular result (two levels that have values whose absolute values are exactly equal to one-half of the value of the third level). I'm wondering if you have somehow duplicated cases? You should a) describe the data collection and b) post the output of str(dat2).

Best Answer

Related Solutions

Solved – Can one obtain a p-value for all levels of factor when doing a GLM

Solved – Level of factor taken as intercept

Related Question