Solved – Proportion data – beta distribution v. GLM with binomial distribution and logit link

beta distributiongeneralized linear modelproportion;

I have a fisheries dataset for which I have calculated value for each grid cell on a map. The value is the proportion of the total fishing sets in that cell for each month/year. So, I have values between 0-1, but not including 0 and 1 (the range is actually very skewed and is: 0.0005347594 to 0.1933216169). I am interested in whether the proportion of fishing sets is higher close to a specific location over time.

I have read that there are two ways to do this – either a GLM with a binomial family and logit link, or a beta regression.

I have tried both of these methods in R:

Binomial GLM:

m1 <- glm(PercentTotalSets ~ factor(SetYear) + DayLength + DistTZCF + DistNWHI, 
          family = binomial(link='logit'), data = Totals_CellId)

Beta:

BetaGLM <- betareg(PercentTotalSets ~ factor(SetYear) + DayLength + DistTZCF + DistNWHI, 
                   data = Totals_CellId )

With the binomial GLM, I get very different results than I would if I ran a GLM with a gamma distribution (e.g., DistNWHI is not significant with a p-value of .9 whereas before it was significant). With the beta regression, I get very similar results to a GLM with a gamma distribution (e.g., DistNWHI is significant with similar p-value).

I think that the beta regression is the correct method, because I do not have 0s or 1s and I need to set bounds, but I am not sure if this is correct.

I'd appreciate any and all advice.

Best Answer

With count data of that form, I'd actually fit a multinomial model (at least to start with*), because several numerators are present in the denominator - each '+1' count could have gone into any of $k$ cells ('sets').

(e.g. see here)

You'll need the denominator you divided by; the model is still for the proportion, but the variability depends on the denominator you used to obtain the proportion.

* a particular concern is that you'll have dependence over both space and time (e.g. adjacent locations and adjacent times will tend to be more related than more distant locations or times - at least if there's unmodelled variation that would be accounted for by such effects)

Once you have fitted a multinomial model, you would want to assess whether you have both the variance and the correlation modelled reasonably well -- you might need mixed models (GLMM) and possibly also to account for potential remaining overdispersion in addition.

You will find a number of discussions of multinomial models here on CV.

Another possibility is to model the counts as Poisson, by allowing for offsets, factors or continuous predictors related to the variation you mentioned as the reason you scaled to proportions.

Related Solutions

Solved – Significance in beta regression and glm binomial

The binomial is for modeling Bernoulli variables (i.e., binary) or binomial variables (i.e., the number of successes from a certain number of independent trials). So this should not be applied to computed rates (successes divided by trials) directly but glm() wants you to supply a matrix with successes and failures. Consequently, your glm() call above yields the warning:

Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

The beta regression model, on the other hand, is intended for situations where you only have a direct rate that does not correspond to success rates from a known number of independent trials. It uses a different likelihood and hence can lead to different results. Specifically, it has an additional precision parameter which is related to the variance of the observations.

Thus, if your proportions above come from a known number of independent trials, then supply this information and use a binomial GLM. Otherwise you can consider beta regression.

Additional remark: As your Y above supplies proportions directly, the binomial likelihood does not fit. Specifically, the variance of the observations will be overestimated. If you use a quasi-binomial with an additional dispersion parameter, the model still won't be really appropriate but much closer to the beta regression results.

R> summary(betareg(Y ~ X))

Call:
betareg(formula = Y ~ X)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-1.7480 -0.8042 -0.1105  0.8864  1.8896 

Coefficients (mean model with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.29444    0.08715   3.378 0.000729 ***
X            0.27270    0.09068   3.007 0.002637 ** 

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)   
(phi)    41.06      15.92   2.579   0.0099 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 15.15 on 3 Df
Pseudo R-squared: 0.4149
Number of iterations: 34 (BFGS) + 2 (Fisher scoring) 

R> summary(glm(Y ~ X, family = quasibinomial))

Call:
glm(formula = Y ~ X, family = quasibinomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.25696  -0.11263  -0.01107   0.13491   0.25792  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.29284    0.09523   3.075   0.0106 *
X            0.27078    0.09910   2.732   0.0195 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.02836306)

    Null deviance: 0.52867  on 12  degrees of freedom
Residual deviance: 0.31489  on 11  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3

Solved – GLM with logit link and Gaussian family to predict a continuous DV between 0 and 1

You seem to want to use a fractional logit, i.e. a quasi-likelihood model for a proportion. The key here is that it is a quasi-likelihood model, so the family refers to the variance function and nothing else. In quasi-likelihood that variance is a nuisance parameter, which does not have to be correctly specified in your model if your dataset is large enough. So I would stick with the usual family for a fractional logit model, and use the binomial family.

Best Answer

Related Solutions

Solved – Significance in beta regression and glm binomial

Solved – GLM with logit link and Gaussian family to predict a continuous DV between 0 and 1

Related Question