Solved – GLM with continuous data piled up at zero

generalized linear modelordered-logitregression-strategieszero inflation

I am trying to run a model to estimate how well catastrophic illnesses such as TB, AIDS etc affect spending on hospitalization. I have "per hospitalization cost" as the dependent variable and various individual markers as independent variables, almost all of which are dummy such as gender, head of household status, poverty status and of course a dummy for whether you have the illness (plus age and age squared) and a bunch of interaction terms.

As is to be expected, there is a significant amount — and I mean a lot — of data piled up at zero (i.e., no expenditure on hospitalization in the 12 month reference period). What would be the best way to deal with data such as these?

As of now I decided to convert the cost into ln(1+cost) so as to include all observations and then run a linear model. Am I on the right track?

Best Answer

As discussed elsewhere on the site, ordinal regression (e.g., proportional odds, proportional hazards, probit) is a flexible and robust approach. Discontinuities are allowed in the distribution of $Y$, including extreme clumping. Nothing is assumed about the distribution of $Y$ for a single $X$. Zero inflated models make far more assumptions than semi-parametric models. For a full case study see my course handouts Chapter 15 at http://hbiostat.org/rms .

One great advantage of ordinal models for continuous $Y$ is that you don't need to know how to transform $Y$ before the analysis.