Solved – Tweedie p parameter Interpretation

frequency-severityinterpretationtweedie-distribution

From Wikipedia (http://en.wikipedia.org/wiki/Tweedie_distribution) we know that

The Tweedie distributions include a number of familiar distributions as well as some unusual ones, each being specified by the domain of the index parameter. We have the
normal distribution, p = 0,
Poisson distribution, p = 1,
compound Poisson–gamma distribution, 1 < p < 2,
gamma distribution, p = 2,
positive stable distributions, 2 < p < 3,
inverse Gaussian distribution, p = 3,
positive stable distributions, p > 3, and
extreme stable distributions, p = ∞.

For 0 < p < 1 no Tweedie model exists.

My question is can we determine significance from the optimal $p$ if it is between 1-2. For instance, if we say that Poisson controls the frequency in which our $Y$ appears and Gamma controls the size of our $Y$ if it appears and if the log-likelihood of the model has an optimal value for $p$ of 1.1 then is it fair to say that the data is driven more by frequency than size and vice versa?

so for example, if you run this code:

set.seed(1)
x <- rep(1:100,times=10)
freq <- sapply(x,function(arg) rpois(1,.25*arg))
size <- sapply(x,function(arg) rgamma(1,100*arg))
y <- freq*size
df <- data.frame(cbind(x,y))

tweediep <- tweedie.profile(y~x, link.power = 0, data = df, 
    do.smooth = FALSE, do.plot = TRUE)
tweediep$xi.max

I get the result 1.3. Does that mean that this Y is more "driven by" frequency than severity? Does it mean something else altogether? Is there nothing I can discern from this estimated value of the parameter $p$ ?

Edit: This is not a complete answer so I'm posting it as an edit and clarification rather than an answer to my own question.

I have attempted to anecdotally answer my own question via a limited experiment. Specifically, I want to create dummy data and a "frequency driven" and "severity driven" model from this data and observe the results to better understand the $p$ parameter.

set.seed(1)

#creating random variables
x1<-rep(1:5,20)
x2<-(runif(100))*10
x3<-rpois(100,5)
x4<-rweibull(100,2,5)

#creating parameters based off a linear combination of variables plus a noise term - all exponentiated
fvars<-exp((x1+2*x2+.5*x3-x4+3*runif(1))/10)/5
svars<-exp((x1+2*x2+.5*x3-x4+3*runif(1))/5)
summary(fvars)
summary(svars)

#freq is the number of occurances, sev is the size of the occurances
freq<-rpois(100,fvars)
sev<-rgamma(100,scale=svars,shape=1)

summary(freq)
summary(sev)

results1<-freq*sev
summary(results1)

#load library if needed
library(tweedie)

#store all the data in a single data frame
df1<-data.frame(cbind(x1,x2,x3,x4,results1))

This is my "base model". I now find the optimal parameter $p$.

tweediep<-tweedie.profile(formula = results1~x1+x2+x3+x4,link.power = 0,do.smooth = TRUE,do.plot = TRUE)
tweediep$xi.max

I get a result of 1.616327 . I went ahead and found the results model coefficients. Strictly for the purposes of answering my question, I don't believe actually fitting the glm is necessary but perhaps the diagnostics of the glm will lead to more insight.

baseGLM<-glm(results1~x1+x2+x3+x4,data=df1,family=tweedie(var.power = tweediep$xi.max,link.power = 0))
summary(baseGLM)

It's possible that the model isn't "fit well enough" for this type of analysis to say anything meaningful, but to continue this line of reasoning, let's try fitting a "frequency driven model".

#freq driven

freq<-rpois(100,fvars*3)
sev<-rgamma(100,scale=svars/4,shape=1)

summary(freq)
summary(sev)

results2<-freq*sev
summary(results2)

df1<-data.frame(cbind(df1,results2))

tweediep2<-tweedie.profile(formula = results2~x1+x2+x3+x4,link.power = 0,do.smooth = TRUE,do.plot = TRUE)
tweediep2$xi.max

here our $p$ parameter value is 1.718367 which is higher than our base value. Again, we fit the model.

GLMfreqDriven<-glm(results2~x1+x2+x3+x4,data=df1,family=tweedie(var.power = tweediep2$xi.max,link.power = 0))
summary(GLMfreqDriven)

Finally, let's make our model more severity driven.

#severity driven

freq<-rpois(100,fvars*.4)
sev<-rgamma(100,scale=svars*5,shape=1)

summary(freq)
summary(sev)

results3<-freq*sev
summary(results3)

df1<-data.frame(cbind(df1,results3))

tweediep3<-tweedie.profile(formula = results3~x1+x2+x3+x4,link.power = 0,do.smooth = TRUE,do.plot = TRUE)
tweediep3$xi.max

Here our $p$ parameter is 1.591937.

This is the opposite of my initial hypothesis. I originally thought that the more frequency driven a model is, the closer the p parameter would be to 1 (since 1 is a true Poisson model), and the more severity driven the model is, closer the p parameter is to 2. In my, admittedly limited, experiment the opposite was true. The Poisson distribution does allow zeros as a response variable. Perhaps the p parameter has nothing to do with "frequency driven" or "severity driven", but is strictly a function of the number of zeros in the data (for p parameter values between 1 and 2).

One possible limitation I thought of with respect to my experiment design is that would be interesting to see how the results change if I force the results1, results2, and results3 variables to all have the same mean. The data here was fit arbitrarily. Several of instances where I am dividing or multiplying by a constant, I am doing so to get the data in such a way that the algorithm converges without error. I would be more confident if the results (1.61, 1.72, 1.59) were more spread out.

In conclusion, I can see possible explanations for results of my limited experiment, but would really be appreciative if someone could further validate and explain how the $p$ parameter is affected by the data. Any discussion as to the differences between creating separate frequency and severity models versus creating compound Poisson-gamma models may also prove to be very insightful.

Thank you

Best Answer

The Generalized Linear Models with Examples in R book by Peter Dunn and Gordon Smyth contains an illuminating discussion of Tweedie distributions.

If I maybe so blunt, and summarize your excellent question:

What is the relation of $p$ and the underlying Poisson-Gamma model for a Tweedie distribution with $1 < p < 2$?

As you already note in your question, the Tweedie distribution with $1 < p < 2$ can be understood as a Poisson-Gamma model. To make it more concrete what this means, let's assume that

$$ N \sim \text{Pois}(\lambda^{*}) $$ and $$ z_i \sim \text{Gam}(\mu^{*}, \phi^{*}) $$ the observed $y$ is $$ y = \sum_{i = 1}^{N}{z_i}. $$ Dunn & Smyth give an example for this model, where $N$ is the number of insurance claims and $z_i$ is the average cost for each claim. In that case the model would describe the total insurance payout.

The relation of $p$ to the parameters of the Poisson and Gamma distribution is

\begin{equation} \begin{aligned} \lambda^{*} &= \frac{\mu^{2-p}}{\phi (2-p)} \\ \mu^{*} &= (2 - p)\phi\mu^{p - 1} \\ \phi^{*} &= (2 - p)(p - 1) \phi^2\mu^{2(p - 1)}, \end{aligned} \end{equation} where $\mu$ and $\phi$ are the mean and overdispersion parameters from the generalized linear model definition.

Related Question