Solved – Coding Categorical Variables

categorical datarregression

Suppose I am building a linear model in R. I will be doing standard OLS. I have 10 dummy variables (predictors) that correspond to different regions. 6 of these regions are in California, and the other 4 are in Texas. For example for my Northern California dummy variable is a 1 if the observation comes from California and 0 if not. I was thinking of creating a categorical variable with the values 1:10 with each number corresponding to a different region. Would this affect my analysis negatively in any way? I have a feeling that this would just distort the interpretation of my estimated coefficient. I also have a dummy variable for California and Texas.

Best Answer

Depending on how you code the analysis, that could cause undesirable consequences. If you specify that your categorical variable is a factor, it'll work fine as a nominal variable: the lm function will create dummy variables for you. If you store the variable as a numeric vector, lm will effectively test a linear contrast of your regions as differing on the outcome variable in the order your code specifies, and by equally-spaced amounts. You probably don't intend to do that. So if region is your categorical variable with values 1:10, code it something like this: lm(y~factor(region)) if you want to have dummy codes created for you in . Better yet, just store region as a factor-type object.

Related Question