Solved – Coding Categorical Variables

categorical datarregression

Suppose I am building a linear model in R. I will be doing standard OLS. I have 10 dummy variables (predictors) that correspond to different regions. 6 of these regions are in California, and the other 4 are in Texas. For example for my Northern California dummy variable is a 1 if the observation comes from California and 0 if not. I was thinking of creating a categorical variable with the values 1:10 with each number corresponding to a different region. Would this affect my analysis negatively in any way? I have a feeling that this would just distort the interpretation of my estimated coefficient. I also have a dummy variable for California and Texas.

Best Answer

Depending on how you code the analysis, that could cause undesirable consequences. If you specify that your categorical variable is a factor, it'll work fine as a nominal variable: the lm function will create dummy variables for you. If you store the variable as a numeric vector, lm will effectively test a linear contrast of your regions as differing on the outcome variable in the order your code specifies, and by equally-spaced amounts. You probably don't intend to do that. So if region is your categorical variable with values 1:10, code it something like this: lm(y~factor(region)) if you want to have dummy codes created for you in r. Better yet, just store region as a factor-type object.

Related Solutions

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].

Solved – Estimation process in OLS with categorical variables and dumthe coding

Just to answer one part of your question:

Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with k levels is divided into k−1 dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?

Hopefully this block of code will help. Look at the X matrix:

set.seed(123987)
n    <- 6
df   <- data.frame(x=runif(n), categorical=factor(letters[1:3]))
df$y <- rnorm(n) + df$x + ifelse(df$categorical == "a", 0,
                                 ifelse(df$categorical == "b", 2, 10))
fit  <- lm(y ~ x + categorical, data=df)
fit$coefficients  # Around -0.1, 2.5, 1.1 and 10.3

X      <- matrix(1, nrow=n, ncol=length(fit$coefficients))
X[, 2] <- df$x
X[, 3] <- 1*(df$categorical == "b")
X[, 4] <- 1*(df$categorical == "c")
colnames(X) <- c("constant", "x", "indicator for b", "indicator for c")  # Aka dummies
Y <- matrix(df$y, ncol=1)

beta_hat <- as.vector(solve(t(X) %*% X) %*% t(X) %*% Y)
max(abs(beta_hat - fit$coefficients))          # Very small -- essentially equal
isTRUE(all.equal(beta_hat, fit$coefficients))  # ...well, not equal enough for all.equal

The matrix X has one column of 1s (the constant); a column of df$x (a continuous predictor); a column that is 1 when the categorical variable equals "b", and zero otherwise; and similarly for "c". The value "a" is omitted, since we have a constant.

Edit: spacing in the code block is messed up for some reason, not sure why.

Best Answer

Related Solutions

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

Solved – Estimation process in OLS with categorical variables and dumthe coding

Related Question