Solved – GLM with continuous data piled up at zero

generalized linear modelordered-logitregression-strategieszero inflation

I am trying to run a model to estimate how well catastrophic illnesses such as TB, AIDS etc affect spending on hospitalization. I have "per hospitalization cost" as the dependent variable and various individual markers as independent variables, almost all of which are dummy such as gender, head of household status, poverty status and of course a dummy for whether you have the illness (plus age and age squared) and a bunch of interaction terms.

As is to be expected, there is a significant amount — and I mean a lot — of data piled up at zero (i.e., no expenditure on hospitalization in the 12 month reference period). What would be the best way to deal with data such as these?

As of now I decided to convert the cost into ln(1+cost) so as to include all observations and then run a linear model. Am I on the right track?

Best Answer

As discussed elsewhere on the site, ordinal regression (e.g., proportional odds, proportional hazards, probit) is a flexible and robust approach. Discontinuities are allowed in the distribution of $Y$, including extreme clumping. Nothing is assumed about the distribution of $Y$ for a single $X$. Zero inflated models make far more assumptions than semi-parametric models. For a full case study see my course handouts Chapter 15 at http://hbiostat.org/rms .

One great advantage of ordinal models for continuous $Y$ is that you don't need to know how to transform $Y$ before the analysis.

Related Solutions

Solved – Ordered Probit and categorical variables

Results from an ordered logit/probit regression are always unintuitive, but categorical explanatory variables are as meaningful as continuous ones. I'd even say that they are easier to interpret.

For a concrete example, you could look at Dobson, An Introduction to Generalizer Linear Models, 2002, 2nd ed., Chapter 8. In her "car preferences" example, the dependent variable is the importance of air conditioning and power steering (three levels: "no or little importance", "important", "very important") and the two explanatory variables are gender (male or female, coded as 1 and 0) and age (18-23, 24-40, >40, coded as age2440 = 1 or 0, and agegt40 = 1 or 0).

Fitting an ordered probit model you get (I've used R, MASS library, polr() function):

Coefficients:
   male age2440 agegt40 
-0.3467  0.6817  1.3288 

Intercepts:
  NoImp|Imp Imp|VeryImp 
    0.01844     0.97594

Then you can compute the probabilities for women (male = 0) over 40 (age2440 = 0, agegt40 = 1):

NoImp     Imp VeryImp 
0.095   0.267   0.638

and for men over 40 (male = 1):

NoImp     Imp VeryImp 
0.168   0.330   0.502

Their difference is the gender partial effect:

 NoImp     Imp VeryImp 
-0.073  -0.063   0.136

I think that it's meaningful ;-)

Solved – Fitting a glm to a zero inflated positive continuous response

The Q-Q plot fit (which compares to normal quantiles) is almost irrelevant. In particular it doesn't tell you that you have a bad model.

While the Tweedie can model data with zeros in it, it tends not to work so well for data that are predominantly zero, and I suspect you probably don't fit the distribution well (it's just that you can't really tell that's the case from the QQ plot)

I'd be inclined to move toward a model that explicitly deals with the zeros, a zero-inflated model or a hurdle model.

Best Answer

Related Solutions

Solved – Ordered Probit and categorical variables

Solved – Fitting a glm to a zero inflated positive continuous response

Related Question