Multinomial Logistic Regression – How to Code Outcome Variables with ‘I Don’t Know’ Options

categorical-encodingdependent variablelogisticregression

I have a categorical outcome variable "Type of intervention" with 3 levels: "Type A", "Type B", &
"cannot decide". The "cannot decide" option is meaningful, i.e. it could be the case that group membership in one or more of my b/n subjects conditions results in more "cannot decide" answers.

Based on my limited stat knowledge, I should use a multinomial logistic model, since the outcome var. is categorical, and has more than 2 levels.
My question is related to the type of coding needed here for the outcome: usually, multinomial regressions use dummy coding, with "0" row values for the reference category. However, I don't think I have a meaningful reference category here – the "cannot decide" does not sound like it.
I am essentially comparing choices for "Type A with choice of "Type B". How should I code the outcome var. in this case?
Thank you very much in advance.

Best Answer

(This all assumes that there is no order to your three categories. I suspect this is the case, though I could believe that you could put them in order, even if I am struggling to explain why I suspect this. If there is an order, then you would want to do ordinal logistic regression, not multinomial.)

There's not special meaning for the reference category.

In a logistic regression with two categories ("regular" logistic regression), we label one category as $1$ and the other as $0$. There is no special meaning to this. If we want to distinguish dogs from cats, dogs can be coded as $1$ or cats can be coded as $1$, and little about the analysis changes. However you do it, the logistic regression seeks out the true probability of being a dog and being a cat.

Ditto for multinomial logistic regression.

If you have dogs, cats, horses, and crocodiles, you can consider any of them to be the "zero" category, and the rest of the analysis follows. If "dog" is the "zero" category, the multinomial logistic regression will explicitly give the probability of being a cat, a horse, and a crocodile. Whatever is needed to get the sum of the probabilities to $1$ is the probability of being a dog.

In your model, as long as you know how the coding works, it does not matter which of the three categories you use as your reference category. If A is the reference, the model will explicitly give probabilities of B and UNDECIDED. If UNDECIDED is the reference category, the model will explicitly give probabilities of A and B, and the rest of the probability is the probability of being UNDECIDED.

In fact, depending on the software, you might have a function that gives explicit probabilities of each category. If you're an R user, you might find it helpful to model your work on the following simulation.

library(nnet)
set.seed(2021)
N <- 12
y <- sample(c("Type A", "Type B", "Cannot Decide"), N, replace = T)
x <- rnorm(N)
L <- nnet::multinom(y ~ x)
fitted(L)

# Prove that the fitted probabilities for an XY observation
# always add up to one 
#
rowsums <- apply(fitted(L), 1, sum) 
for (item in rowsums){print(item)}
Related Question