Solved – Fitting a glm to a zero inflated positive continuous response

I'm trying to fit R glm's to data sets where the response is zero inflated positive continuous. This is an example data set

df <- structure(list(y=c(0,0,0,0,0,0,0,0,4.09417142857143e-05,0,0,0,0,2.18718787234043e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,4.60192985143918e-05,0,1.16500081901182e-05,0,1.13497040795196e-05,0,2.01959645093946e-05,0,0,2.24704801120448e-05,3.50195529889299e-05,2.26178388643899e-06,0,0,0,0,0,0,1.24748435005701e-05,0,0,0,1.14008923269261e-06,0,0,0,0,0,0,0,1.95445414576182e-05,0,0,0,0,0,0,0,0,1.72293987591653e-05,1.11122316638726e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),behavior = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,4),time = c(272,148,132,854,234,340,90,282,840,230,118,620,130,1410,1272,684,6624,1220,250,4556,4500,2254,662,334,336,1364,346,4308,372,7448,1242,9658,926,6706,494,1996,3570,13550,62134,6876,3108,4722,794,2394,21502,21048,5442,2748,920,25596,1954,7896,996,12108,2046,23034,3216,7574,9882,5560,5610,4524,6846,3450,11630,3528,3546,9544,4820,4852,13508,3726,3744,1252,1254,2514,3786,2534,14080,3882,6510,10520,72928,10052)),.Names = c("y","behavior","time"),row.names = c(NA, 84),class = "data.frame")

set behavior to be a factor

df$behavior = as.factor(df$behavior)

The response is df$y, df$behavior and df$time are the fixed effects (categorical and continuous, respectively).

A glimpse at the distribution of the response:
enter image description here

I tried using the R tweedie package with glm2 (and base glm) with this code:

tweedie.fit <- tweedie.profile(y ~ behavior + time, do.plot=TRUE, data = df)
model.fit <- glm2(y ~ behavior + time, family=tweedie(var.power=out$xi.max, link.power=1-out$xi.max), data = df)

which either does not converge (in case of other data sets) or shows poor qq-plot fit: enter image description here

Is there any hope to reliably fit a glm to such data? If so what am I doing wrong?

I also tried using the R gamlss package with a log normal family where I add .Machine$double.eps to df$y:

df$y <- df$y + .Machine$double.eps
model.fit <- gamlss(y ~ behavior + time, data = df, family = LOGNO())

but it doesn't seem to do much better:
enter image description here

Help would be greatly appreciated.

Best Answer

The Q-Q plot fit (which compares to normal quantiles) is almost irrelevant. In particular it doesn't tell you that you have a bad model.

While the Tweedie can model data with zeros in it, it tends not to work so well for data that are predominantly zero, and I suspect you probably don't fit the distribution well (it's just that you can't really tell that's the case from the QQ plot)

I'd be inclined to move toward a model that explicitly deals with the zeros, a zero-inflated model or a hurdle model.

Best Answer

Related Solutions

Solved – Approaches to regression with zero inflated response

Solved – Regression Trees / Boosted Regression Trees for Tweedie Distribution in R

Related Question