Solved – Transformation to fit gamma distribution for glm

data transformationgamma distributiongeneralized linear modelr

The data simulated below has a maximum value of 4 and is interestingly skewed. The maximum of 4 is a limitation imposed by the instrument used and the data is semi-discrete, i.e., there are a reasonably large number of numbers it could be between -4 and 4. Because of the shape of the data, I thought about transforming it so it would approximate a gamma distribution:

Edit to update for comments:
It is limited to this range in this instance because it is a signal detection measure (d prime http://en.wikipedia.org/wiki/D%27) and the accuracy we have for this particular measure limits us to +-4. It is skewed like this because one population does not very often get false positives and will generally get more hits while the other populations often do get false positives and less hits.

set.seed(69)
g1<-rnorm(700,0,1); g2<-rnorm(100,-0.5,1.5); g3<-rnorm(100,-1,2.5)
gt<-data.frame(score=c(g1, g2, g3), fac1=factor(rep(c("a", "b", "c"), c(700, 100, 100))), fac2=ordered(rep(c(0,1,2), c(3,13,4))))
gt$score<-with(gt, ifelse(fac2 == 0, score, score-rnorm(1, 0.5, 2)))
gt$score<-with(gt, ifelse(fac2 == 2, score-rnorm(1, 0.5, 2), score))
gt$score<-round(with(gt, ifelse(score>0, score*-1, score)), 1)+4
gt$score<-with(gt, ifelse(score < -4, -4, score))
gt$cov1<-with(gt, score + rnorm(900, sd=40))/40
hist(gt$score)
gt$score2<-with(gt, 4-score+0.0000001) #Gamma distribution can't have 0s (and is positive skewed???)
hist(gt$score2)

glm1<-glm(score2~cov1+fac1*fac2, family="Gamma", data=gt)

This is quite new territory for me.
1. Is this a reasonable thing to do?
2. Are there other distributions I might try and compare (exponential perhaps)?

Update:
After some comments below, I investigated beta regression using the betareg package in R. It gave me skewed residuals:

gt$scorer<-with(gt, (score--4)/(4--4))
gt$scorer<-with(gt, (scorer*(length(scorer)-1)+0.5)/length(scorer))
b1 <- betareg(scorer ~ cov1 + fac1 * fac2, data=gt)
plot(density(resid(b1))) #Strange residuals, even straight lm looks better

So I had a look at a quasibinomial regression and it gave me smaller and better looking residuals:

glm2 <- glm(scorer~cov1 + fac1 * fac2, data=gt, family="quasibinomial")
plot(density(resid(g1))) #Better residuals

Are the residuals good enough to go on in this case?
Or is the fact that d', while based upon T/F, is not a binary variable, a serious issue?

Edit 3: d' clarification
The below is an example of my d' scores, with the rough distributional qualities and similar raw scores for hits and false positives.

hitrate<-sample(0:16, 100, replace=T, prob=c(rep(0.02,11), 0.025, 0.05, 0.1, 0.2, 0.3, 0.2))/16
hitrate<-ifelse(hitrate==1, 31/32,hitrate); hitrate<-ifelse(hitrate==0, 1/32,hitrate)
farate<-sample(0:32,100, replace=T, prob=c(0.7,0.1,0.05,0.05,0.05,0.02,rep(0.001, 27)))/32
farate<-ifelse(farate==0, 1/64,farate); farate<-ifelse(farate==1, 63/64,farate)

dprime<-round(qnorm(hitrate) - qnorm(farate),1)
plot(density(dprime))

Best Answer

A gamma distribution definitely doesn't make sense for you data. Gamma takes support on the entire real line and is always skewed right. The example data you provide in your code would be horrible data to try to fit a gamma to.

It would definitely be nicer to know more about the data generation process. But one thing that comes to mind is you could scale and shift the data to be constrained between 0 and 1 and then attempt to use a Beta distribution to model that. Once again it would be better to know more about your data but a Beta is one of the few well known parametric distributions that is bounded below and above.

However, it seems you want to do some sort of a regression. Have you tried to fit the regression assuming a normal error term and examining the residuals? A lot of people assume that the data itself needs to be normally distributed for a linear regression to work but typically we place the assumption on the error term and depending on the values your covariates take this can lead to a skewed distribution for your response variable.

Related Solutions

Generalized Linear Model – Using R for GLM with Gamma Distribution

The usual gamma GLM contains the assumption that the shape parameter is constant, in the same way that the normal linear model assumes constant variance.

In GLM parlance the dispersion parameter, $\phi$ in $\text{Var}(Y_i)=\phi\text{V}(\mu_i)$ is normally constant.

More generally, you have $a(\phi)$, but that doesn't help.

It might perhaps be possible to use a weighted Gamma GLM to incorporate this effect of a specified shape parameter, but I haven't investigated this possibility yet (if it works it is probably the easiest way to do it, but I am not at all sure that it will).

If you had a double GLM you could estimate that parameter as a function of covariates... and if the double glm software let you specify an offset in the variance term you could do this. It looks like the function dglm in the package dglm let you specify an offset. I don't know if it will let you specify a variance model like (say) ~ offset(<something>) + 0 though.

Another alternative would be to maximize the likelihood directly.

> y <- rgamma(100,10,.1)

> summary(glm(y~1,family=Gamma))

Call:
glm(formula = y ~ 1, family = Gamma)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.93768  -0.25371  -0.05188   0.16078   0.81347  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0103660  0.0003486   29.74   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.1130783) 

    Null deviance: 11.223  on 99  degrees of freedom
Residual deviance: 11.223  on 99  degrees of freedom
AIC: 973.56

Number of Fisher Scoring iterations: 5

The line where it says:

   (Dispersion parameter for Gamma family taken to be 0.1130783)

is the one you want.

That $\hat\phi$ is related to the shape parameter of the Gamma.

Solved – Data transformation to fit gamma distribution in R

You should maybe look into zero-inflated gamma models. Some examples with R code, here a variant with jags/bugs code, here with the gamlss R package.

There is also a lot of modeling advice for your particular rainfall data in the comments.

Best Answer

Related Solutions

Generalized Linear Model – Using R for GLM with Gamma Distribution

Solved – Data transformation to fit gamma distribution in R

Related Question