Solved – Categorical variable coding to compare all levels to all levels

categorical datacategorical-encodingmodelingrregression

I am trying to determine the best coding system for my categorical variables to use in a regression with categorical and continuous variables. I have been using this page as a resource but none of the coding methods seem to match what I want to do: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

I have reason to expect that there could be differences among all four levels of a categorical variable and therefore would like to code it to perform all pairwise comparisons (1vs2, 1vs3, 1vs4, 2vs3, 2vs4, 3vs4,). I thought that a Sum (deviation) method would accomplish this as it says it compares each level to the grand mean. But in my model I only see terms for levels 1, 2, 3, and do not have a term for level 4 of the variable. Is this level incorporated into the intercept? If so is there another coding system I should use to get at the comparisons I am looking for?

Best Answer

You want all possible pairwise comparisons of levels, but there are much more pairs than there are degrees of freedom in the factor. Say the factor has five levels, then you need 4 parameters to code it, but there are $\binom{5}{2}$ pairs, that is, 10 pairs. So it is imposible to find a coding with one parameter for each comparison.

The solution is to use whatever coding you wants, and then compute the 10 pairwise contrasts afterwards, after estimating the model, from the model output. In R, for instance, this could be done many ways , either "by hand", or with the use of packages like contrast or multcomp.

Below an R example, done "by hand", for confidence intervals of all pairwise comparisions:

xfac  <-  factor(rep(1:5, each=10))
y     <-  rnorm(50, mean=c(rep(0, 20), rep(1, 30)), sd=2)

mod   <-  lm( y ~ 0 + xfac)

# generating a hypothesis contrasts matrix with 10 rows:
# each row is one contrast:      
cmat  <-  matrix(0, 10, 5)
nam   <-  character(length=10)   
row   <-  0
for (i in 1:4) for (j in (i+1):5)   {
                   row  <-  row+1
                   nam[row]  <-  paste("x[", i, "]-x[", j, "]", sep="")
                   cmat[row, c(i, j)]  <-  c(1, -1)
               }
rownames(cmat)  <-  nam  

# We write a contrast testing function by hand:
my.contrast  <-  function(mod,  cmat)  {
    co  <-  coef(mod)
    CV  <-  vcov(mod)
    se  <-  sqrt( diag( cmat %*% CV %*% t(cmat) ))
    df  <-  mod$df.residual
    contr  <-  cmat %*% co
    ul  <-  qt(0.975, df=df)
    ci  <-  cbind(contr-ul*se, contr+ul*se)
    ci
}

And then using it gives the result:

> my.contrast(mod, cmat)
               [,1]      [,2]
x[1]-x[2] -1.946376 1.7921298
x[1]-x[3] -3.044916 0.6935897
x[1]-x[4] -2.136283 1.6022227
x[1]-x[5] -2.301393 1.4371135
x[2]-x[3] -2.967793 0.7707130
x[2]-x[4] -2.059160 1.6793460
x[2]-x[5] -2.224269 1.5142368
x[3]-x[4] -0.960620 2.7778861
x[3]-x[5] -1.125729 2.6127769
x[4]-x[5] -2.034362 1.7041439

Best Answer

Related Solutions

Categorical Encoding – Coding Categorical Variables for Regression

Related Question