Assume I have a x
and y
-matrix (input and outcome) as follows:
x: facility gender ... | y: length_of_stay
A M ... | 3
B F ... | 12
A F ... | 4
C M ... | 3
C F ... | 6
A M ... | 9
y
has only one dimension. x
has many dimensions (facility
, gender
and others not of interest). facility
and gender
are categorical. x
and y
are arbitrary length, but the same length. Assume that I intend to do analysis on dimensions other than facility
and gender
, and assume that facility
and gender
are confounding variables.
How do I adjust length_of_stay
so that it is controlled for facility
and gender
?
I believe that the answer is to do multiple regression between facility
, gender
, and length of stay
. However, I'm fuzzy on the details of creating the dummy variables (how do I avoid the dummy variable trap when I have an intercept at the origin, and how do I avoid the dummy variable trap when I don't have an intercept at the origin?). I'm also fuzzy on how to use the betas after completing multiple regression to adjust length_of_stay
.
Best Answer
as.factor()
, then the basiclm()
function will create dummy codes for you, with the reference category being the first alphabetically (in this case,A
).lm(length_of_stay ~ facility + gender, data=df)
, wheredf
is the data frame that includes both of thex
andy
matrices.facility
, the difference between the reference category and other categories after controlling for gender. For gender (making the assumption that we are sticking to the gender binary), the coefficient would represent the difference between male and female after controlling for facility.