Solved – multiple choice data simple logistic regression or multinomial logistic regression

logisticmultinomial-distributionrsurvey

I have a survey question where the respondent can check one choice or two choice maximum. Question looks like this:

What is the more important characteristic when you buy chocolate?

  • Sweetness (characteristic 1)
  • Packaging (characteristic 2)
  • Price (characteristic 3)
  • … (characteristic 4)
  • … (characteristic 5)
  • … (characteristic 6)

In my data, I have 6 column (1 for each characteristic) with the value 1 if the respondent checked the ad-hoc column and 0 otherwise.

I would like to perform a regression where the outcome is the question and the predictors the socio demographic criteria (some of them are numerical such as the age and other categorical such as the gender).

Now I wonder how I should do. Should I do a simple logistic regression 6 times with same predictors but each time different outcome?

  • glm(caracteristic1 ~ age+gender)
  • glm(caracteristic2 ~ age+gender)
  • glm(caracteristic3 ~ age+gender)
  • glm(caracteristic4 ~ age+gender)
  • glm(caracteristic5 ~ age+gender)
  • glm(caracteristic6 ~ age+gender)

Or is it possible to use another method such as the multinomial logistic regression. If yes which R packages do you recommend me to use?

Best Answer

If the user was allowed to select only one of the characteristics, then multinomial logistic regression would be an acceptable model choice. However, since you allow the user to select possibly two characteristics, you cannot use the multinomial model directly. This is because the multinomial model assumes the responses to follow a multinomial distribution, in which out of say $n$ categories, a single response corresponds to a single category.

You can model the categories separately, but then you will not take into account the correlation between categories.

If you do wish to consider correlation between categories, then you should consider multivariate-response regression models for categorical responses. I can point you to two (theoretical) resources:
1. The "Marginal Models for Correlated Categorical Responses" from Sec 3.5 of Fahrmeir and Tutz's book: "Multiple Statistical Modeling based on Generalized Linear Models." (Try searching for this book online)
2. The paper Glonek and McCullagh (1995) - "Multivariate Logistic Models" - gives you the theory behind the extension of the binary logistic model to multivariate responses. They also provide a computational scheme for finding maximum-likelihood estimates.

The bad news is that I don't know any R package that does any of the above for you, so you will either have to search harder, or write the relevant functions yourself.

Related Question