Solved – Coding categorical variables for regression

categorical datacategorical-encodingmultiple regression

I'm not sure of the best way to code my categorical predictor variable for use in a hierarchical regression in order to test my specific hypothesis. This categorical variable has 3 levels representing 3 groups. I want to compare group 1 to group 2, group 1 to group 3 and group 2 to group 3. I know that for dummy coding I create k-1 variables, so 2 dummy variables in my case and code these variables with 0s and 1s while choosing one level of the categorical variable to be a reference category.

However, I'm not sure this is the best way of making the comparisons I wish to make as it appears I could only compare each group to the reference category, am I correct? So if group 3 was the reference category I could compare group 1 to group 3 and group 2 to group 3 but I could not compare group 1 to group 2. What alternative method of coding should I use to make these comparisons? My regression model will also contain continuous variables. I'm an undergrad psychology student and statistics are not my strong point simple answers would be best for me. I use SPSS. Thank you!

Best Answer

Here is an example using the employee data.sav data, which comes with standard installation. Suppose salary is the dependent variable, job category, jobcat, is the categorical independent variable, and beginning salary, salbegin, is the continuous independent variable. Using GLM, you can perform pairwise comparisons between each pair of job categories. The steps are as follow:

With the data set open, go to Analyze > General Linear Model > Univariate.
Put the dependent variable and independent variable into the correct slots. Categorical independent variables go to "Fixed Factor(s)" and continuous ones go to "Covariate(s)." Do not worry about the Random Factors. When it's all set, click the "Model" button.
In the Model panel, highlight the two independent variables, then change the build term to "Main effects," and then click the arrow button (indicated by the red circle) to bring the two variables over. When all set, click "Continue."
Now, click the "Option" button.
In the Option panel, do the followings: 1) Highlight jobcat, 2) bring it over to the right by clicking the arrow button, 3) Check "Compare Main Effects", 4) Specify the adjustment you'd like to make for the multiple pairwise comparisons. I left it as LSD which does not adjust for multiple tests, 5) Check "Parameter Estimates" so that you'll also get the regression coefficients. When it's all done, click Continue and then OK to submit the test.
Here is the regression coefficient table:
Scroll down a bit and you'll find the pairwise comparisons table:

Related Solutions

Solved – Multiple regression interaction with categorical IV with 3 levels

You can definitely do that. you can introduce your categorical variable as a factorial one. If you have decided to use R programming this following code would be fine:

new_categ<-factor(categ,labels=c(0:2))

Then, you can interact the new categorical variable with other independent ones. You also could find examples centered around your problem in Modern Applied Statistics with S-PLUS by Venables and Ripley. However, if you are not willing to use R, you can still read its examples about regression which are beneficial for figuring out how to solve your problem.

Solved – Categorical variable coding to compare all levels to all levels

You want all possible pairwise comparisons of levels, but there are much more pairs than there are degrees of freedom in the factor. Say the factor has five levels, then you need 4 parameters to code it, but there are $\binom{5}{2}$ pairs, that is, 10 pairs. So it is imposible to find a coding with one parameter for each comparison.

The solution is to use whatever coding you wants, and then compute the 10 pairwise contrasts afterwards, after estimating the model, from the model output. In R, for instance, this could be done many ways , either "by hand", or with the use of packages like contrast or multcomp.

Below an R example, done "by hand", for confidence intervals of all pairwise comparisions:

xfac  <-  factor(rep(1:5, each=10))
y     <-  rnorm(50, mean=c(rep(0, 20), rep(1, 30)), sd=2)

mod   <-  lm( y ~ 0 + xfac)

# generating a hypothesis contrasts matrix with 10 rows:
# each row is one contrast:      
cmat  <-  matrix(0, 10, 5)
nam   <-  character(length=10)   
row   <-  0
for (i in 1:4) for (j in (i+1):5)   {
                   row  <-  row+1
                   nam[row]  <-  paste("x[", i, "]-x[", j, "]", sep="")
                   cmat[row, c(i, j)]  <-  c(1, -1)
               }
rownames(cmat)  <-  nam  

# We write a contrast testing function by hand:
my.contrast  <-  function(mod,  cmat)  {
    co  <-  coef(mod)
    CV  <-  vcov(mod)
    se  <-  sqrt( diag( cmat %*% CV %*% t(cmat) ))
    df  <-  mod$df.residual
    contr  <-  cmat %*% co
    ul  <-  qt(0.975, df=df)
    ci  <-  cbind(contr-ul*se, contr+ul*se)
    ci
}

And then using it gives the result:

> my.contrast(mod, cmat)
               [,1]      [,2]
x[1]-x[2] -1.946376 1.7921298
x[1]-x[3] -3.044916 0.6935897
x[1]-x[4] -2.136283 1.6022227
x[1]-x[5] -2.301393 1.4371135
x[2]-x[3] -2.967793 0.7707130
x[2]-x[4] -2.059160 1.6793460
x[2]-x[5] -2.224269 1.5142368
x[3]-x[4] -0.960620 2.7778861
x[3]-x[5] -1.125729 2.6127769
x[4]-x[5] -2.034362 1.7041439

Best Answer

Related Solutions

Solved – Multiple regression interaction with categorical IV with 3 levels

Solved – Categorical variable coding to compare all levels to all levels

Related Question