Generalized Linear Model – Pros and Cons of Tweedie GLM for Non-Negative Data with Zeros

generalized linear modeltweedie-distributionzero inflation

I analyze technical measurement data with the aim of developing a forecasting model.

The data is given as a non-negative time series (data per hour). The data looks quite wild and contain many zeros. I expect these zeros to be the result of measuring quantities that are not zero but too small. It is ok to treat those zeros as zero.enter image description here

Just transforming (Box-Cox e.g.) these data with exact zeros does not seem right.
So I thought about a classical glm first. But these do not allow for mass at zero and a continuous distribution above zero.

So I stumbled upon Tweedie glms eg here.
It must be quite a standard problem in the task to predict measurements. What are the pros and cons of working with Tweedie?

PS: The number of zeros decreases with the years … I have to ask the data collector why this could be …

Best Answer

You're right to think that a Box-Cox transformation won't deal with the zeros issue (nor indeed would any other transformation).

The Tweedie might be suitable, and is sometimes used for data like these*, but the probability of a zero is related to the $p$ (the power in the variance function).

*\ another issue to consider -- your data are observed over time, so you must also consider the possibility of time-dependence (such as autocorrelation).

A more common solution to the zeros would be a zero-inflated or hurdle model, such as a zero-inflated gamma. There are numerous questions on site on "zero-inflated"/"0-inflated" models and hurdle models.

However if your thought was correct and it's only "too small to register", that would indicate censoring.

Looking at the plot though, I have some doubts that it's an adequate explanation for what we see:

enter image description here

Between the lower two grey lines, there are only three points, but a large number of points either exactly on the line (or very close to it). That big gap would be consistent with your thought, but those three points (circled in red) do not seem consistent with it -- if those points can register, why not others?

However, such a banding feature can sometimes be seen in Tweedie distributions as well; the tricky part would be whether it's even possible to get the right mix of parameters to match both the proportion of zeros and the banding at lower values.

enter image description here

(Beware interpreting those plots; the spikes at zero are not density but probability, and strictly speaking should not be represented on the same plot as the continuous part. You can draw a cdf but it's less clear what's going on.)

However, even more seriously perhaps, the Tweedie definitely cannot reproduce the clumping behaviour at the top end of the plot (for that matter neither can any of the other models I've mentioned).

Related Question