Using GLM: Gaussian, Poisson vs Gamma

gamma distributiongeneralized linear modelpoisson distributionregression

I am trying to perform a GLM analaysis using R for an outcome that is:

  1. Bounded by 0 – 10
  2. In steps of 1

(Numerical Rating Scale for Pain: 0 – 10)

I have a set of demographic factors, age, sex etc, that I want to input as factors for the GLM.

I understand that Gaussian might not be the best option (since bounded by 0) but am not sure if I should choose Gamma (since this is not continuous) or Poisson (since the outcome is not counts)

the data is very much skewed:

enter image description here

thanks

s

Best Answer

This answer elaborates on some discussion in comments on the answer from Nick Cox.

Your situation might be handled by a multi-category extension of binomial regression: ordinal regression. You model the probability of moving from one category to the next in a way that takes advantage of the ordering among the outcome categories.

This UCLA web page illustrates ordinal logistic regression, based on a "proportional odds" (PO) assumption for moving up the scale. I don't know whether that assumption will hold for your data, but the page does show how to evaluate it.

Also, as Frank Harrell points out in Section 13.3.3 of his Regression Modeling Strategies book, a PO model can sometimes work well even if the assumption isn't met. In this answer to a question on highly skewed data that take only a few values with clumping at one end, he says:

When the dependent variable Y has a beautiful distribution I still recommend it be modeled using a Y-transformation-invariant semiparametric ordinal regression model such as the proportional odds model. With your Y, the need for a semiparametric model is even greater. Semiparametric models handle arbitrary clumping of Y values, bimodality, floor effects, ceiling effects, and outliers. Such models are also very efficient.

The orm() function in Harrell's rms package allows for ordinal regression with link functions other than the logit, and Section 13.4 of his book shows how to implement a "continuation ratio" method that sometimes works better than a PO model. That provides you some flexibility in how to proceed.

With a PO model you can often model, without overfitting, almost as many parameters as you can with linear regression. Section 4.4 of Harrell's book and course notes provides an estimate of the effective sample size that takes the distribution of cases among categories into account. Your sample size of about 200 would be reduced to an effective sample size of about 180 on that basis, so you should be able to estimate about 12 regression coefficients.

Related Question