Solved – Test in R whether coefficient estimates of categorical variables are different in linear regression

rregression

How do I check in R whether coefficients of different levels of categorical variables are statistically the same. The model that I have is:

Y = Intercept + X1 + X2 + X3 + X4

Both X1 and X2 are categorical variables with 3 and 4 levels each. X3 and X4 are continuous variables. I came to know that in R we can check whether coefficients of two continuous variables are the same using the following procedure. For instance, if I want to check whether X3 and X4 have the same coefficient, I could do the following:

Model1: lm(y~X1+X2+I(X3+X4))
Model2: lm(y~X1+X2+X3+X4)

Then, I can do an anova test as follows:

anova(Model1,Model2)

Now, how do I check whether the coefficients for different levels of X1 are the same. The regression output will give 2 coefficient estimates (or should I say intercepts?) for X1 as it has three levels. How to check whether these estimates are statistically different from each other?

Best Answer

There is nothing too special about categorical variables when we use lm. If X1 has three levels, what happens is that we represent X1 in terms of three binary variables whose sum is always one (i.e., only one of them equals one at any observation). So, then we want to test whether all the levels have the same coefficient. Let

set.seed(1)
df <- data.frame(y = rnorm(10), x = factor(sample(1:3, 10, replace = TRUE)))
(mod <- lm(y ~ x - 1, data = df))
#
# Call:
# lm(formula = y ~ x - 1, data = df)
#
# Coefficients:
#       x1        x2        x3  
#  0.64897  -0.30579  -0.02534

Hence, we want to test H₀ that x1, x2, and x3 have the same coefficients.

library(car)
linearHypothesis(mod, c("x1 = x2", "x2 = x3"))
# Linear hypothesis test
#
# Hypothesis:
# x1 - x2 = 0
# x2 - x3 = 0
#
# Model 1: restricted model
# Model 2: y ~ x - 1
#
#   Res.Df    RSS Df Sum of Sq      F Pr(>F)
# 1      9 5.4838                           
# 2      7 3.5987  2    1.8852 1.8335 0.2289

As expected, we cannot reject the null in this example.

Then there's another, somewhat simpler way to see this. Let now

(mod <- lm(y ~ x, data = df))
#
# Call:
# lm(formula = y ~ x, data = df)
#
# Coefficients:
# (Intercept)           x2           x3  
#      0.6490      -0.9548      -0.6743

so that now the interpretation of the coefficients of x2 and x3 is "additive". E.g., when the level of x is 2, how much higher is y than when the level is 1? So, in this case, if the effect of all three levels is the same, in this specification x2 and x3 will have zero coefficients. Thus,

linearHypothesis(mod, c("x2 = 0", "x3 = 0"))
# Linear hypothesis test
#
# Hypothesis:
# x2 = 0
# x3 = 0
#
# Model 1: restricted model
# Model 2: y ~ x
#
#   Res.Df    RSS Df Sum of Sq      F Pr(>F)
# 1      9 5.4838                           
# 2      7 3.5987  2    1.8852 1.8335 0.2289

gives, as expected, the same p-value.

On the other hand, if all the levels have the same effect, then x is nothing but a constant variable, like the intercept. So then the first testing option above can be seen as testing that x is as useful as the intercept, while the second one, equivalently, that x doesn't add anything useful over the intercept.

Related Solutions

Solved – Interpreting intercepts in mixed effect model with categorical predictors

The difference of the outputs from two models dues to the different parameterization of the effect of group.

In the first model, you allowed the intercept, you model would be (disregard the random effect part) y=beta0+beta1*X+e, where X=1 for phone and 0 for computer. So in the output, beta0 should be the effect due to computer (when X=0) and beta0+beta1 is the effect due to phone (when X=1). The difference between these two effects are beta1 (effect due to computer).

While in the second model, the intercept is omitted. So the model will be (also disregard the random effect part) yi=alpha_i+e, where i=1,2 (i=1 for the computer group and 2 for the phone group) Thus the two estimates for the fixed effects are for computer and phone respectively. You can see the difference between the two estimates (i.e. phone-computer) is the beta1 in model 1 (In model one the computer group is treated as reference).

And for the significance part, from model 1, the group effect is not significant (i.e. the effect from phone compared to computer is not significantly different on the outcome.)

Model Selection in Longitudinal Data – Testing the Need for Random-Effects Terms in Longitudinal Data Analysis

The likelihood ratio test is slightly incorrect (in general, conservative) for testing the significance of a random effect, because the null value ($\sigma^2=0$) is at the boundary of the feasible space, but in this case there is overwhelmingly strong evidence against the null hypothesis. The model with random effects of individual is 15713-6772=8941 log-likelihood units better; twice the log-likelihood value is $\chi^2$ distributed, so the direct p-value calculation would give you ...

pchisq(2*8941,df=1,lower.tail=FALSE,log.p=TRUE)/log(10)
## -3885.251

... a p-value of approximately $10^{-3885}$.

You should really consider a random-slope model (random = ~time|id) as well.

Update: relative to the random-intercept model, the random-slopes model is again much better. The improvement is now 935 log-likelihood units, which doing the equivalent calculation as above corresponds to a rejection of the null hypothesis (among-individual variation in slope is equal to zero) with a p-value of "only" $10^{-408}$.

Best Answer

Related Solutions

Solved – Interpreting intercepts in mixed effect model with categorical predictors

Model Selection in Longitudinal Data – Testing the Need for Random-Effects Terms in Longitudinal Data Analysis

Related Question