Solved – Estimation process in OLS with categorical variables and dumthe coding

categorical dataleast squares

In my question (Cox model on bank customers) regarding the estimation process in regression with categorical variables, @Scortchi write the following:

Any coefficient in a multiple regression model represents the relationship between a predictor & the response holding constant all the other predictors in the model. So not only the point estimate, p-value, &c. of a coefficient change when you add in other terms, but also its interpretation.

This, I understand, in terms of interpreting the coefficients. Holding every predictor constant but one, and varying this one predictor, gives us the relationship with the response. He kindly gave me some links to other posts, and especially this one (Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?) peaked my interest. @gung explains how one are controlling for a second variable when regressing on a first variable (if I formulated this correctly).

Now, I understand Ordinary Least Squares in the following way:
$y=X\beta$, where $X$ is the $n\times m$ matrix of coefficients. This equation does not have a solution (y is not in the column space of X), so we must project y down on the column space of X – to get $\hat{y}$ – which is the OLS solution (with $m=2$):

Projection of Y into Col(x)

This gives us $\hat{\beta}=(X^TX)^{-1}X^Ty$.

Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with $k$ levels is divided into $k-1$ dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?

In @gung's answer (link above) he shows us how controlling for a second variable means we are fitting a plane, instead of a line. Now this plane he mentions, is this the (imagined) tilted (upwards and to the right) plane connecting the red, blue and green marks in his second plot(below)?

Gungs plot

(OBS: this plot is from @gung here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?)

How does this (imagined) tilted plane relate with the OLS estimation formula $\hat{\beta}=(X^TX)^{-1}X^Ty$?

Also, I asked @gung the following in a comment:

… Regarding your first plot, we have three OLS lines, each giving us a value for $\beta_1$. What happens if these three values of $\beta_1$ are different?…

I imagine such a plane, if I understand anything at all, would be "bent" (and some extra dimension would come into play). He's answer is:

…Briefly, when you have 2 X variables, you are fitting a plane instead of a line. If the appropriate slope of the y~x2 relationship changes as x1 increases, that means there is an interaction b/t x1&x2.

Best Answer

Just to answer one part of your question:

Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with k levels is divided into k−1 dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?

Hopefully this block of code will help. Look at the X matrix:

set.seed(123987)
n    <- 6
df   <- data.frame(x=runif(n), categorical=factor(letters[1:3]))
df$y <- rnorm(n) + df$x + ifelse(df$categorical == "a", 0,
                                 ifelse(df$categorical == "b", 2, 10))
fit  <- lm(y ~ x + categorical, data=df)
fit$coefficients  # Around -0.1, 2.5, 1.1 and 10.3

X      <- matrix(1, nrow=n, ncol=length(fit$coefficients))
X[, 2] <- df$x
X[, 3] <- 1*(df$categorical == "b")
X[, 4] <- 1*(df$categorical == "c")
colnames(X) <- c("constant", "x", "indicator for b", "indicator for c")  # Aka dummies
Y <- matrix(df$y, ncol=1)

beta_hat <- as.vector(solve(t(X) %*% X) %*% t(X) %*% Y)
max(abs(beta_hat - fit$coefficients))          # Very small -- essentially equal
isTRUE(all.equal(beta_hat, fit$coefficients))  # ...well, not equal enough for all.equal

The matrix X has one column of 1s (the constant); a column of df$x (a continuous predictor); a column that is 1 when the categorical variable equals "b", and zero otherwise; and similarly for "c". The value "a" is omitted, since we have a constant.

Edit: spacing in the code block is messed up for some reason, not sure why.

Related Solutions

Regression – Impact of Omitted Dummy Variable Coefficients in OLS

How to use / interpret the coefficients from a regression model with categorical variables to get predicted variables depends on how your variables are coded. There are many different coding schemes (see here for a good overview). It sounds like you used 'reference cell coding', which most people call 'dummy coding'. I gather your race1 category is the reference category. In this case, the intercept is the mean of the race1 group. To compute the predicted value, you would solve the equation using whatever values for other variables apply and omitting the coefficients for the other categories (i.e., race2 & race3). There is some good, relevant info here, and here.

edit: The way the question is phrased made me think about situations in which there is only one factor in the model, however, @Michelle raises the question of the more general case. To keep this relatively simple, imagine a case with just two factors, e.g. race and sex, plus some continuous covariates. Using reference cell coding, we will create a dummy for male. Now, solving the regression equation without including any of the factor coefficients (i.e., just the intercept + continuous covariates) yields the predicted mean of the reference cell, which in this case is the race1 female group. Should you want to know the value for race1 males, you would solve as above, but also include the coefficient for male. If you wanted to ignore sex, or make a prediction for a mixed-sex group, you would calculate a weighted average of the above two predictions. Obviously, this will get more complicated as the number of factors, $J$, increases, but the pattern should be clear enough.

Solved – SPSS dumthe variables in OLS

If you have the advanced statistics package that allows you do estimate generalized linear models (see the menus Analyze -> Generalized Linear Models or the GENLIN command), you can have SPSS on the fly generate the dummy variables. Given your data it may be good to see if some of the newer mixed model commands can estimate auto-regressive components for panel data.

Alternatively, you can use the DO REPEAT syntax to efficiently generate your dummy variables for use in regression equations. For instance, for your weekDay variable it would be;

VECTOR weekDay_Dummy(7,F1.0).
DO REPEAT weekDay_Dummy = weekDay_Dummy1 to weekDay_Dummy7 /i = 1 to 7.
    DO IF weekDay = i. 
        COMPUTE weekDay_Dummy = 1.
    ELSE IF weekDay <> i.
        COMPUTE weekDay_Dummy = 0.
    END IF.
END REPEAT.

As long as your variables are in a sequential list of integer values, the do repeat command will work (if they aren't in a sequential list see the AUTORECODE command). Then in the linear regression command you can subsequently use the TO operator to specify a list of variables that are in sequential order in your dataset (extra note it has to do with the order of the variables in the dataset, nothing to do with the names directly).

Below I have an example.

data list free / company  sector  obsDay weekDay  stockPrice.
begin data
1        15      1      3        10.40
1        15      2      4         9.42
1        15      3      5         9.66
1        15      4      1        11.00
1        15      5      2        10.21
2        10      1      3        43.55
2        10      2      4        43.50
2        10      3      5        40.31
2        10      4      1        48.43
2        10      5      2        43.00
3        20      1      3        10.00
3        20      2      4        11.00
3        20      3      5        12.00
3        20      4      1        13.00
3        20      5      2        14.00
end data.
dataset name examp.

VECTOR weekDay_Dummy(7,F1.0).
DO REPEAT weekDay_Dummy = weekDay_Dummy1 to weekDay_Dummy7 /i = 1 to 7.
    DO IF weekDay = i. 
        COMPUTE weekDay_Dummy = 1.
    ELSE IF weekDay <> i.
        COMPUTE weekDay_Dummy = 0.
    END IF.
END REPEAT.

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT stockPrice
  /METHOD=ENTER weekDay_Dummy2 to weekDay_Dummy5.

Another extension command with more flexibility for writing dummy variables, SPSSINC_CREATE_DUMMIES (written in Python) is on the developerworks site (but I have not used it). Also one of the members here, ttnphns, has some tools to accomplish similar tasks on his site. Given your example though a few do repeat commands should be sufficient.

Best Answer

Related Solutions

Regression – Impact of Omitted Dummy Variable Coefficients in OLS

Solved – SPSS dumthe variables in OLS

Related Question