Solved – Recoding a variable with three levels into a dumthe variable

categorical datacategorical-encoding

I need to recode the variable school setting (urban, sub-urban and rural settings) into a dummy variable. I know that when creating a dummy variable, there is one category less (so 2 rather than three conditions) and that the urban is the largest group and should be used as baseline (not sure if baseline is the right word, but hopefully you know what I mean). However, I don't know what to do from here.

Is the variable changed into:
1. Two dummy variables: setting 1 with urban/sub-urban and setting 2 with urban/rural?
2. One variable where urban is given the value 1, and urban and sub-urban are given the value 0?
3. Both?
4. Have I completely missed something?

Hope you can help me!!

Best Answer

Actually, there are different types of dummy variable coding. The simplest method is redundant, since it uses three dummy variables:

          V1  V2  V3
urban      1   0   0
suburban   0   1   0
rural      0   0   1

Although redundant, this is also a valid encoding. A less redundant way to encode, requiring only two dummy variables, is this:

          V1   V2
urban      0    0
suburban   1    0
rural      0    1

Note that depending on your statistical framework, chances are that you don't need to worry about the coding. In R, if you specify your variable as a factor and create a model matrix, you will not have to worry about dummy variable coding too much; you will just use your variable in the formula passed to model.matrix and that's it.

However, you will have to be confronted with the contrast coding (how to get, out of your dummy variables, the comparisons that you are interested in). You will find a thorough description here.

Related Question