Just to answer one part of your question:
Now I am having difficulties understanding how this is done with
categorical variables. I have read about dummy coding, and that a
categorical variable with k levels is divided into k−1 dummy variables
and so on, but I fail to see how this is actually implemented with
regard to the actual OLS estimation (formulas above). How would the
matrix of coefficients above look, if we are dealing with a
categorical variable and dummy coding?
Hopefully this block of code will help. Look at the X matrix:
set.seed(123987)
n <- 6
df <- data.frame(x=runif(n), categorical=factor(letters[1:3]))
df$y <- rnorm(n) + df$x + ifelse(df$categorical == "a", 0,
ifelse(df$categorical == "b", 2, 10))
fit <- lm(y ~ x + categorical, data=df)
fit$coefficients # Around -0.1, 2.5, 1.1 and 10.3
X <- matrix(1, nrow=n, ncol=length(fit$coefficients))
X[, 2] <- df$x
X[, 3] <- 1*(df$categorical == "b")
X[, 4] <- 1*(df$categorical == "c")
colnames(X) <- c("constant", "x", "indicator for b", "indicator for c") # Aka dummies
Y <- matrix(df$y, ncol=1)
beta_hat <- as.vector(solve(t(X) %*% X) %*% t(X) %*% Y)
max(abs(beta_hat - fit$coefficients)) # Very small -- essentially equal
isTRUE(all.equal(beta_hat, fit$coefficients)) # ...well, not equal enough for all.equal
The matrix X
has one column of 1s (the constant); a column of df$x
(a continuous predictor); a column that is 1 when the categorical variable equals "b"
, and zero otherwise; and similarly for "c"
. The value "a"
is omitted, since we have a constant.
Edit: spacing in the code block is messed up for some reason, not sure why.
Best Answer
Your interpretation of the continuous predictors you have entered in the regression model seems to be somewhat mistaken. A more appropriate way to understand it would be "the expected increase/decrease in the dependent variable for one unit change in the independent variable". It appears that you have confused it somewhat with the interpretaiton of the R2 of the total regression model. The interpretation of dummy variables follows the same principle. You can conceptualize it as the expected increase/decrease in the dependent variable for a change from 0 to 1 in the independent variable.
Imagine you have dummy coded a variable representing gender and for the sake of this example let Male=0 and Female=1. Let's say the dependent variable is time (in seconds) to complete a 100 m race. An unstandardized regression coefficient of +1.5 would suggest that if the independent variable is 1 (=female) an increase of 1.5 seconds in the time required to run 100 m is expected in comparison to males (condition male=0). Notice that what I said here relates to unstandardized regression coefficients; however, the discussion wouldn't differ as much for standardized regression coefficients.
In the context of a multiple regression the interpretation of a dummy independent variable wouldn't be different to what I just described, it's just that the regression coefficient should be interpreted under the assumption that you have controlled for the remaining independent variables in the model.