Suppose I am building a linear model in R. I will be doing standard OLS. I have 10 dummy variables (predictors) that correspond to different regions. 6 of these regions are in California, and the other 4 are in Texas. For example for my Northern California dummy variable is a 1 if the observation comes from California and 0 if not. I was thinking of creating a categorical variable with the values 1:10 with each number corresponding to a different region. Would this affect my analysis negatively in any way? I have a feeling that this would just distort the interpretation of my estimated coefficient. I also have a dummy variable for California and Texas.
Solved – Coding Categorical Variables
categorical datarregression
Related Solutions
I think you are making this hard on yourself. Make sure race
is a factor
variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):
require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f) # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots
When doing multiple imputation with the Hmisc
package aregImpute
function or with the mice
package, you would substitute the following for the 2nd line above:
f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)
which would adjust the covariance matrix for multiple imputation [n.impute
recommended to be the percent of observations that have any variable missing].
Just to answer one part of your question:
Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with k levels is divided into k−1 dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?
Hopefully this block of code will help. Look at the X matrix:
set.seed(123987)
n <- 6
df <- data.frame(x=runif(n), categorical=factor(letters[1:3]))
df$y <- rnorm(n) + df$x + ifelse(df$categorical == "a", 0,
ifelse(df$categorical == "b", 2, 10))
fit <- lm(y ~ x + categorical, data=df)
fit$coefficients # Around -0.1, 2.5, 1.1 and 10.3
X <- matrix(1, nrow=n, ncol=length(fit$coefficients))
X[, 2] <- df$x
X[, 3] <- 1*(df$categorical == "b")
X[, 4] <- 1*(df$categorical == "c")
colnames(X) <- c("constant", "x", "indicator for b", "indicator for c") # Aka dummies
Y <- matrix(df$y, ncol=1)
beta_hat <- as.vector(solve(t(X) %*% X) %*% t(X) %*% Y)
max(abs(beta_hat - fit$coefficients)) # Very small -- essentially equal
isTRUE(all.equal(beta_hat, fit$coefficients)) # ...well, not equal enough for all.equal
The matrix X
has one column of 1s (the constant); a column of df$x
(a continuous predictor); a column that is 1 when the categorical variable equals "b"
, and zero otherwise; and similarly for "c"
. The value "a"
is omitted, since we have a constant.
Edit: spacing in the code block is messed up for some reason, not sure why.
Best Answer
Depending on how you code the analysis, that could cause undesirable consequences. If you specify that your categorical variable is a
factor
, it'll work fine as a nominal variable: thelm
function will create dummy variables for you. If you store the variable as anumeric
vector,lm
will effectively test a linear contrast of your regions as differing on the outcome variable in the order your code specifies, and by equally-spaced amounts. You probably don't intend to do that. So ifregion
is your categorical variable with values1:10
, code it something like this:lm(y~factor(region))
if you want to have dummy codes created for you in r. Better yet, just storeregion
as afactor
-type object.