Solved – Transformation to fit gamma distribution for glm

data transformationgamma distributiongeneralized linear modelr

The data simulated below has a maximum value of 4 and is interestingly skewed. The maximum of 4 is a limitation imposed by the instrument used and the data is semi-discrete, i.e., there are a reasonably large number of numbers it could be between -4 and 4. Because of the shape of the data, I thought about transforming it so it would approximate a gamma distribution:

Edit to update for comments:
It is limited to this range in this instance because it is a signal detection measure (d prime http://en.wikipedia.org/wiki/D%27) and the accuracy we have for this particular measure limits us to +-4. It is skewed like this because one population does not very often get false positives and will generally get more hits while the other populations often do get false positives and less hits.

set.seed(69)
g1<-rnorm(700,0,1); g2<-rnorm(100,-0.5,1.5); g3<-rnorm(100,-1,2.5)
gt<-data.frame(score=c(g1, g2, g3), fac1=factor(rep(c("a", "b", "c"), c(700, 100, 100))), fac2=ordered(rep(c(0,1,2), c(3,13,4))))
gt$score<-with(gt, ifelse(fac2 == 0, score, score-rnorm(1, 0.5, 2)))
gt$score<-with(gt, ifelse(fac2 == 2, score-rnorm(1, 0.5, 2), score))
gt$score<-round(with(gt, ifelse(score>0, score*-1, score)), 1)+4
gt$score<-with(gt, ifelse(score < -4, -4, score))
gt$cov1<-with(gt, score + rnorm(900, sd=40))/40
hist(gt$score)
gt$score2<-with(gt, 4-score+0.0000001) #Gamma distribution can't have 0s (and is positive skewed???)
hist(gt$score2)

glm1<-glm(score2~cov1+fac1*fac2, family="Gamma", data=gt)

This is quite new territory for me.
1. Is this a reasonable thing to do?
2. Are there other distributions I might try and compare (exponential perhaps)?

Update:
After some comments below, I investigated beta regression using the betareg package in R. It gave me skewed residuals:

gt$scorer<-with(gt, (score--4)/(4--4))
gt$scorer<-with(gt, (scorer*(length(scorer)-1)+0.5)/length(scorer))
b1 <- betareg(scorer ~ cov1 + fac1 * fac2, data=gt)
plot(density(resid(b1))) #Strange residuals, even straight lm looks better

So I had a look at a quasibinomial regression and it gave me smaller and better looking residuals:

glm2 <- glm(scorer~cov1 + fac1 * fac2, data=gt, family="quasibinomial")
plot(density(resid(g1))) #Better residuals

Are the residuals good enough to go on in this case?
Or is the fact that d', while based upon T/F, is not a binary variable, a serious issue?

Edit 3: d' clarification
The below is an example of my d' scores, with the rough distributional qualities and similar raw scores for hits and false positives.

hitrate<-sample(0:16, 100, replace=T, prob=c(rep(0.02,11), 0.025, 0.05, 0.1, 0.2, 0.3, 0.2))/16
hitrate<-ifelse(hitrate==1, 31/32,hitrate); hitrate<-ifelse(hitrate==0, 1/32,hitrate)
farate<-sample(0:32,100, replace=T, prob=c(0.7,0.1,0.05,0.05,0.05,0.02,rep(0.001, 27)))/32
farate<-ifelse(farate==0, 1/64,farate); farate<-ifelse(farate==1, 63/64,farate)

dprime<-round(qnorm(hitrate) - qnorm(farate),1)
plot(density(dprime))

Best Answer

A gamma distribution definitely doesn't make sense for you data. Gamma takes support on the entire real line and is always skewed right. The example data you provide in your code would be horrible data to try to fit a gamma to.

It would definitely be nicer to know more about the data generation process. But one thing that comes to mind is you could scale and shift the data to be constrained between 0 and 1 and then attempt to use a Beta distribution to model that. Once again it would be better to know more about your data but a Beta is one of the few well known parametric distributions that is bounded below and above.

However, it seems you want to do some sort of a regression. Have you tried to fit the regression assuming a normal error term and examining the residuals? A lot of people assume that the data itself needs to be normally distributed for a linear regression to work but typically we place the assumption on the error term and depending on the values your covariates take this can lead to a skewed distribution for your response variable.