Solved – Proportion data – beta distribution v. GLM with binomial distribution and logit link

beta distributiongeneralized linear modelproportion;

I have a fisheries dataset for which I have calculated value for each grid cell on a map. The value is the proportion of the total fishing sets in that cell for each month/year. So, I have values between 0-1, but not including 0 and 1 (the range is actually very skewed and is: 0.0005347594 to 0.1933216169). I am interested in whether the proportion of fishing sets is higher close to a specific location over time.

I have read that there are two ways to do this – either a GLM with a binomial family and logit link, or a beta regression.

I have tried both of these methods in R:

Binomial GLM:

m1 <- glm(PercentTotalSets ~ factor(SetYear) + DayLength + DistTZCF + DistNWHI, 
          family = binomial(link='logit'), data = Totals_CellId) 

Beta:

BetaGLM <- betareg(PercentTotalSets ~ factor(SetYear) + DayLength + DistTZCF + DistNWHI, 
                   data = Totals_CellId ) 

With the binomial GLM, I get very different results than I would if I ran a GLM with a gamma distribution (e.g., DistNWHI is not significant with a p-value of .9 whereas before it was significant). With the beta regression, I get very similar results to a GLM with a gamma distribution (e.g., DistNWHI is significant with similar p-value).

I think that the beta regression is the correct method, because I do not have 0s or 1s and I need to set bounds, but I am not sure if this is correct.

I'd appreciate any and all advice.

Best Answer

With count data of that form, I'd actually fit a multinomial model (at least to start with*), because several numerators are present in the denominator - each '+1' count could have gone into any of $k$ cells ('sets').

(e.g. see here)

You'll need the denominator you divided by; the model is still for the proportion, but the variability depends on the denominator you used to obtain the proportion.

* a particular concern is that you'll have dependence over both space and time (e.g. adjacent locations and adjacent times will tend to be more related than more distant locations or times - at least if there's unmodelled variation that would be accounted for by such effects)

Once you have fitted a multinomial model, you would want to assess whether you have both the variance and the correlation modelled reasonably well -- you might need mixed models (GLMM) and possibly also to account for potential remaining overdispersion in addition.

You will find a number of discussions of multinomial models here on CV.


Another possibility is to model the counts as Poisson, by allowing for offsets, factors or continuous predictors related to the variation you mentioned as the reason you scaled to proportions.