Solved – Estimation process in OLS with categorical variables and dumthe coding

categorical dataleast squares

In my question (Cox model on bank customers) regarding the estimation process in regression with categorical variables, @Scortchi write the following:

Any coefficient in a multiple regression model represents the relationship between a predictor & the response holding constant all the other predictors in the model. So not only the point estimate, p-value, &c. of a coefficient change when you add in other terms, but also its interpretation.

This, I understand, in terms of interpreting the coefficients. Holding every predictor constant but one, and varying this one predictor, gives us the relationship with the response. He kindly gave me some links to other posts, and especially this one (Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?) peaked my interest. @gung explains how one are controlling for a second variable when regressing on a first variable (if I formulated this correctly).

Now, I understand Ordinary Least Squares in the following way:
$y=X\beta$, where $X$ is the $n\times m$ matrix of coefficients. This equation does not have a solution (y is not in the column space of X), so we must project y down on the column space of X – to get $\hat{y}$ – which is the OLS solution (with $m=2$):

Projection of Y into Col(x)

This gives us $\hat{\beta}=(X^TX)^{-1}X^Ty$.

Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with $k$ levels is divided into $k-1$ dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?

In @gung's answer (link above) he shows us how controlling for a second variable means we are fitting a plane, instead of a line. Now this plane he mentions, is this the (imagined) tilted (upwards and to the right) plane connecting the red, blue and green marks in his second plot(below)?

Gungs plot

(OBS: this plot is from @gung here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?)

How does this (imagined) tilted plane relate with the OLS estimation formula $\hat{\beta}=(X^TX)^{-1}X^Ty$?

Also, I asked @gung the following in a comment:

… Regarding your first plot, we have three OLS lines, each giving us a value for $\beta_1$. What happens if these three values of $\beta_1$ are different?…

I imagine such a plane, if I understand anything at all, would be "bent" (and some extra dimension would come into play). He's answer is:

…Briefly, when you have 2 X variables, you are fitting a plane instead of a line. If the appropriate slope of the y~x2 relationship changes as x1 increases, that means there is an interaction b/t x1&x2.

Best Answer

Just to answer one part of your question:

Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with k levels is divided into k−1 dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?

Hopefully this block of code will help. Look at the X matrix:

set.seed(123987)
n    <- 6
df   <- data.frame(x=runif(n), categorical=factor(letters[1:3]))
df$y <- rnorm(n) + df$x + ifelse(df$categorical == "a", 0,
                                 ifelse(df$categorical == "b", 2, 10))
fit  <- lm(y ~ x + categorical, data=df)
fit$coefficients  # Around -0.1, 2.5, 1.1 and 10.3

X      <- matrix(1, nrow=n, ncol=length(fit$coefficients))
X[, 2] <- df$x
X[, 3] <- 1*(df$categorical == "b")
X[, 4] <- 1*(df$categorical == "c")
colnames(X) <- c("constant", "x", "indicator for b", "indicator for c")  # Aka dummies
Y <- matrix(df$y, ncol=1)

beta_hat <- as.vector(solve(t(X) %*% X) %*% t(X) %*% Y)
max(abs(beta_hat - fit$coefficients))          # Very small -- essentially equal
isTRUE(all.equal(beta_hat, fit$coefficients))  # ...well, not equal enough for all.equal

The matrix X has one column of 1s (the constant); a column of df$x (a continuous predictor); a column that is 1 when the categorical variable equals "b", and zero otherwise; and similarly for "c". The value "a" is omitted, since we have a constant.

Edit: spacing in the code block is messed up for some reason, not sure why.