Multinomial Logistic Regression – How to Code Outcome Variables with ‘I Don’t Know’ Options

categorical-encodingdependent variablelogisticregression

I have a categorical outcome variable "Type of intervention" with 3 levels: "Type A", "Type B", &
"cannot decide". The "cannot decide" option is meaningful, i.e. it could be the case that group membership in one or more of my b/n subjects conditions results in more "cannot decide" answers.

Based on my limited stat knowledge, I should use a multinomial logistic model, since the outcome var. is categorical, and has more than 2 levels.
My question is related to the type of coding needed here for the outcome: usually, multinomial regressions use dummy coding, with "0" row values for the reference category. However, I don't think I have a meaningful reference category here – the "cannot decide" does not sound like it.
I am essentially comparing choices for "Type A with choice of "Type B". How should I code the outcome var. in this case?
Thank you very much in advance.

Best Answer

(This all assumes that there is no order to your three categories. I suspect this is the case, though I could believe that you could put them in order, even if I am struggling to explain why I suspect this. If there is an order, then you would want to do ordinal logistic regression, not multinomial.)

There's not special meaning for the reference category.

In a logistic regression with two categories ("regular" logistic regression), we label one category as $1$ and the other as $0$. There is no special meaning to this. If we want to distinguish dogs from cats, dogs can be coded as $1$ or cats can be coded as $1$, and little about the analysis changes. However you do it, the logistic regression seeks out the true probability of being a dog and being a cat.

Ditto for multinomial logistic regression.

If you have dogs, cats, horses, and crocodiles, you can consider any of them to be the "zero" category, and the rest of the analysis follows. If "dog" is the "zero" category, the multinomial logistic regression will explicitly give the probability of being a cat, a horse, and a crocodile. Whatever is needed to get the sum of the probabilities to $1$ is the probability of being a dog.

In your model, as long as you know how the coding works, it does not matter which of the three categories you use as your reference category. If A is the reference, the model will explicitly give probabilities of B and UNDECIDED. If UNDECIDED is the reference category, the model will explicitly give probabilities of A and B, and the rest of the probability is the probability of being UNDECIDED.

In fact, depending on the software, you might have a function that gives explicit probabilities of each category. If you're an R user, you might find it helpful to model your work on the following simulation.

library(nnet)
set.seed(2021)
N <- 12
y <- sample(c("Type A", "Type B", "Cannot Decide"), N, replace = T)
x <- rnorm(N)
L <- nnet::multinom(y ~ x)
fitted(L)

# Prove that the fitted probabilities for an XY observation
# always add up to one 
#
rowsums <- apply(fitted(L), 1, sum) 
for (item in rowsums){print(item)}

Related Solutions

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].

Categorical Encoding – Coding Categorical Variables for Regression

Here is an example using the employee data.sav data, which comes with standard installation. Suppose salary is the dependent variable, job category, jobcat, is the categorical independent variable, and beginning salary, salbegin, is the continuous independent variable. Using GLM, you can perform pairwise comparisons between each pair of job categories. The steps are as follow:

With the data set open, go to Analyze > General Linear Model > Univariate.
Put the dependent variable and independent variable into the correct slots. Categorical independent variables go to "Fixed Factor(s)" and continuous ones go to "Covariate(s)." Do not worry about the Random Factors. When it's all set, click the "Model" button.
In the Model panel, highlight the two independent variables, then change the build term to "Main effects," and then click the arrow button (indicated by the red circle) to bring the two variables over. When all set, click "Continue."
Now, click the "Option" button.
In the Option panel, do the followings: 1) Highlight jobcat, 2) bring it over to the right by clicking the arrow button, 3) Check "Compare Main Effects", 4) Specify the adjustment you'd like to make for the multiple pairwise comparisons. I left it as LSD which does not adjust for multiple tests, 5) Check "Parameter Estimates" so that you'll also get the regression coefficients. When it's all done, click Continue and then OK to submit the test.
Here is the regression coefficient table:
Scroll down a bit and you'll find the pairwise comparisons table:

Best Answer

Related Solutions

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

Categorical Encoding – Coding Categorical Variables for Regression

Related Question