Solved – Control for multiple confounding categorical variables using regression

categorical datacontrolling-for-a-variablemultiple regressionregression

Assume I have a x and y-matrix (input and outcome) as follows:

x: facility gender ... | y: length_of_stay
         A      M  ... |                3
         B      F  ... |               12 
         A      F  ... |                4 
         C      M  ... |                3 
         C      F  ... |                6 
         A      M  ... |                9

y has only one dimension. x has many dimensions (facility, gender and others not of interest). facility and gender are categorical. x and y are arbitrary length, but the same length. Assume that I intend to do analysis on dimensions other than facility and gender, and assume that facility and gender are confounding variables.

How do I adjust length_of_stay so that it is controlled for facility and gender?

I believe that the answer is to do multiple regression between facility, gender, and length of stay. However, I'm fuzzy on the details of creating the dummy variables (how do I avoid the dummy variable trap when I have an intercept at the origin, and how do I avoid the dummy variable trap when I don't have an intercept at the origin?). I'm also fuzzy on how to use the betas after completing multiple regression to adjust length_of_stay.

Best Answer

  1. How many levels are there of facility and gender?
  2. What program are you using? If you use R and set facility and gender as.factor(), then the basic lm() function will create dummy codes for you, with the reference category being the first alphabetically (in this case, A).
  3. The code is as simple as: lm(length_of_stay ~ facility + gender, data=df), where df is the data frame that includes both of the x and y matrices.
  4. The coefficients can be interpreted as: for facility, the difference between the reference category and other categories after controlling for gender. For gender (making the assumption that we are sticking to the gender binary), the coefficient would represent the difference between male and female after controlling for facility.
Related Question