Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

categorical datalogistic

Feel free to critique my overall approach instead of answering my question directly.

I want to look at bivariate relationships among a binary outcome and multiple predictor variables before conducting multiple regression. EDIT: [The data are multiply imputed] and the categorical predictors have been dummy coded. Therefore, for each nominal categorical variable in the original data, there are k-1 dummy variables, with the omitted category serving as the reference group.

It seemed to me that there was no meaningful "bivariate" relationship between the outcome and a single [dummy] variable from a collection of related variables (this would change the comparison category to "everything else," which could not be the case in a multiple regression). For this reason, I did not use bivariate correlations. Instead, I did a series of "bivariate" logistic regressions including in a single regression all k-1 dummies of a variable. So, like: outcome = race_black race_other (race_white omitted). In any case, the reference category cannot be examined directly since it was omitted in the multiple imputation process and has missing values.

I figured if any coefficient of a binary dummy were significant, I would include the whole variable (group of k-1 dummies) in the omnibus regression.

After laboriously compiling a table of the results of these "bivariate" regressions without including the intercept, it occurred to me that (MAYBE?) the intercept is the "dummy" for the omitted category. If it is significant, is it in fact indicating that the omitted group is different from the mean on the outcome? If that's true, should I include the variable in my omnibus regression even if none of the explicit category variables are significant predictors?

I'm worried that I am thinking wrongly about the comparison groups and meanings of significant coefficients.

Best Answer

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].