Beta Regression – Calculating Different Pseudo-$R^2$ for a Betareg Model

beta-regressionpseudo-r-squared

Sorry if this is a bit long..
I've been trying to fit models predicting the % of area infested in a field (response between 0 and 100%, total of 61 fields), with four explanatory variables, two factorials (planting and monitor) and two covariates (area2peri and Tmin7_bef).

I used the betareg() with link = "cauchit" (after comparing LL and AIC for the same predictors with different link functions), and tried the different combinations of parameters for both mean and dispersion sub-models. Would like some help understanding the possibilities of using different pseudo-$R^2$ for the different models. I am referring to the table on the UCLA website.

There are a few things that are not clear to me, and looking at betareg documentation did not solve them..

The summary(betareg(...)) provides a pseudo-$R^2$ square that is the "squared correlation of linear predictor and link-transformed response". How is that calculated? and to which type of the pseudo-$R^2$ it relates (Efron's,McFadden's, Cox&Snell etc.)?

I tried calculating the different types myself, as shown in the betareg documentation p.20, fitted a null (intercept only) and full models and extracted the LogLik for both. However there were several issues:

The formula suggested for Mcfadden's $R^2$ in betareg is the inverse of the one presented in the UCLA website, the null model's LL is the numerator in betareg and the denominator in UCLA.. what am I missing?

Here are the results for the different $R^2$, as well as the full model summary:

    Call:
    betareg(formula = A2to5 ~ planting + area2peri_m + monitor + Tmin7_bef | planting + monitor, data = na.omit(pre_n0), 
        link = "cauchit")
    
    Standardized weighted residuals 2:
        Min      1Q  Median      3Q     Max 
    -2.1745 -0.6122  0.1443  0.9129  1.6225 
    
    Coefficients (mean model with cauchit link):
                  Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  -0.766418   0.714323  -1.073   0.2833  
    plantinglate -1.013075   0.409167  -2.476   0.0133 *
    area2peri_m  -0.013529   0.006384  -2.119   0.0341 *
    monitor1      0.712917   0.277825   2.566   0.0103 *
    Tmin7_bef     0.111958   0.049065   2.282   0.0225 *
    
    Phi coefficients (precision model with log link):
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)    0.9635     0.2577   3.738 0.000185 ***
    plantinglate   0.4337     0.3331   1.302 0.192929    
    monitor1      -0.2295     0.3145  -0.730 0.465674    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 12.55 on 8 Df
Pseudo R-squared: 0.07703
Number of iterations: 31 (BFGS) + 1 (Fisher scoring)


Mfull <-as.vector(logLik(betareg_cauchit)) # 12.54867
Mintercept <- as.vector(logLik(betareg_cauchit_intonly)) #4.207168
n <- betareg_cauchit_intonly$n #61

## McFadden's pseudo-R-squared (explained portion of variance)- according to betreg documentation
1 -(Mintercept/Mfull) #0.6647

## McFadden's pseudo-R-squared - according to UCLA
1 -(Mfull/Mintercept) #-1.9827

## adjusted McFadden's pseudo-R-squared
1-(Mfull-8)/Mintercept #-0.08117

## Cox&Snell (improvment over null model)
1-((Mintercept/Mfull)^(2/n)) #0.0352

max_cox_senll <- 1-Mintercept^(2/n) #-0.0482

## Cragg & Uhler’s (improvment over null model)
(1-((Mintercept/Mfull)^(2/n)))/max_cox_senll #-0.73

Which brings me to:

In all of my calculation attempts, hopefully not erroneous, I did not achieve the summary's pseudo-$R^2$ (relating to the first question) but what is the meaning of a negative value? Is my fit that bad?
Finally, I get that there is no consensus of which type to use, or should one report the pseudo-$R^2$ at all, but how can I really judge if my models is able to explain something in the world?

Best Answer

The pseudo-R-squared reported by betareg is the squared correlation of the linear predictor and the link-transformed response (default link: logit). For the mj_vd model from example("MockJurors", package = "betareg") that you cite, this can be replicated via:
```
summary(mj_vd) ## reports: Pseudo R-squared: 0.03885
cor(qlogis(MockJurors$confidence), predict(mj_vd, type = "link"))^2 ## [1] 0.03885128
```
This pseudo-R-squared is not explicitly mentioned in the UCLA website but it is of type 3 (square of correlation).
The McFadden pseudo-R-squared provided on the UCLA website is for discrete data where the likelihood contributions are actually probabilities between 0 and 1 (and thus the corresponding log-likelihood are negative). This is not the case for beta regression. In this case Smithson & Verkuilen (2006) also call this "proportional reduction of error (PRE)". This has the inverse ratio of the log-likelihoods. See also: calculating a pseudo R2 value when deviance is negative
The negative value for the McFadden pseudo-R-squared is due to the inverse ratio of log-likelihoods. This only works if both are negative (e.g., as in logistic regression) but here both log-likelihoods are positive.
I generally find the pseudo-R-squared to be of rather limited use in beta regression. Whether the model fit is useful (for you) depends on many aspects, e.g., whether you mostly want a model for the mean or you are really interested in a probabilistic fit. A comprehensive answer for this is beyond the scope of this question, though. Alternative means of model comparisons could include information criteria (AIC or BIC), scores for the mean predicitons like (root) mean-squared error (MSE, RMSE), or probabilistic scoring rules (CRPS, log-score), or graphical model assessments (quantile-quantile plots, PIT histograms, etc.).

Related Solutions

Solved – How to calculate pseudo-$R^2$ from R’s logistic regression

Don't forget the rms package, by Frank Harrell. You'll find everything you need for fitting and validating GLMs.

Here is a toy example (with only one predictor):

set.seed(101)
n <- 200
x <- rnorm(n)
a <- 1
b <- -2
p <- exp(a+b*x)/(1+exp(a+b*x))
y <- factor(ifelse(runif(n)<p, 1, 0), levels=0:1)
mod1 <- glm(y ~ x, family=binomial)
summary(mod1)

This yields:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.8959     0.1969    4.55 5.36e-06 ***
x            -1.8720     0.2807   -6.67 2.56e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 258.98  on 199  degrees of freedom
Residual deviance: 181.02  on 198  degrees of freedom
AIC: 185.02

Now, using the lrm function,

require(rms)
mod1b <- lrm(y ~ x)

You soon get a lot of model fit indices, including Nagelkerke $R^2$, with print(mod1b):

Logistic Regression Model

lrm(formula = y ~ x)

                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       

Obs           200    LR chi2      77.96    R2       0.445    C       0.852    
 0             70    d.f.             1    g        2.054    Dxy     0.705    
 1            130    Pr(> chi2) <0.0001    gr       7.801    gamma   0.705    
max |deriv| 2e-08                          gp       0.319    tau-a   0.322    
                                           Brier    0.150                     


          Coef    S.E.   Wald Z Pr(>|Z|)
Intercept  0.8959 0.1969  4.55  <0.0001 
x         -1.8720 0.2807 -6.67  <0.0001

Here, $R^2=0.445$ and it is computed as $\left(1-\exp(-\text{LR}/n)\right)/\left(1-\exp(-(-2L_0)/n)\right)$, where LR is the $\chi^2$ stat (comparing the two nested models you described), whereas the denominator is just the max value for $R^2$. For a perfect model, we would expect $\text{LR}=2L_0$, that is $R^2=1$.

By hand,

> mod0 <- update(mod1, .~.-x)
> lr.stat <- lrtest(mod0, mod1)
> (1-exp(-as.numeric(lr.stat$stats[1])/n))/(1-exp(2*as.numeric(logLik(mod0)/n)))
[1] 0.4445742
> mod1b$stats["R2"]
       R2 
0.4445742

Ewout W. Steyerberg discussed the use of $R^2$ with GLM, in his book Clinical Prediction Models (Springer, 2009, § 4.2.2 pp. 58-60). Basically, the relationship between the LR statistic and Nagelkerke's $R^2$ is approximately linear (it will be more linear with low incidence). Now, as discussed on the earlier thread I linked to in my comment, you can use other measures like the $c$ statistic which is equivalent to the AUC statistic (there's also a nice illustration in the above reference, see Figure 4.6).

Solved – post-hoc test for betareg model R

Post-hoc testing for beta regressions works in the same way that it does for other maximum likelihood (regression) models. In R there are various packages with object-oriented implementations of such procedures, e.g., lmtest, car, multcomp among others.

For testing individual hypotheses it is probably easiest to use linearHypothesis() from car. For example, for equality of the second and third of the site effects in the mean regression:

linearHypothesis(mymodel, "sitesite2 = sitesite3")
## Linear hypothesis test
## 
## Hypothesis:
## sitesite2 - sitesite3 = 0
## 
## Model 1: restricted model
## Model 2: prot ~ fert + site + cut | site
## 
##   Res.Df Df  Chisq Pr(>Chisq)
## 1      2                     
## 2      1  1 2.2491     0.1337

And for the dispersion (phi) submodel, the coefficients have to be prefixed with (phi)_ as shown in coef(mymodel):

linearHypothesis(mymodel, "(phi)_sitesite2 = (phi)_sitesite3")
## Linear hypothesis test
## 
## Hypothesis:
## phi)_sitesite2 - phi)_sitesite3 = 0
## 
## Model 1: restricted model
## Model 2: prot ~ fert + site + cut | site
## 
##   Res.Df Df  Chisq Pr(>Chisq)  
## 1      2                       
## 2      1  1 4.7682    0.02899 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For testing multiple hypotheses (e.g., Dunnett or Tukey type contrasts) one can use glht() from multcomp in a similar fashion.

summary(glht(mymodel, linfct = c("sitesite2 = 0", "sitesite3 = 0")))
##          Simultaneous Tests for General Linear Hypotheses
## 
## Fit: betareg(formula = prot ~ fert + site + cut | site, link = "loglog")
## 
## Linear Hypotheses:
##                Estimate Std. Error z value Pr(>|z|)
## sitesite2 == 0  0.01488    0.03448   0.431    0.806
## sitesite3 == 0  0.04103    0.03494   1.174    0.320
## (Adjusted p values reported -- single-step method)
summary(glht(mymodel, linfct = c("`(phi)_sitesite2` = 0", "`(phi)_sitesite3` = 0")))
##          Simultaneous Tests for General Linear Hypotheses
## 
## Fit: betareg(formula = prot ~ fert + site + cut | site, link = "loglog")
## 
## Linear Hypotheses:
##                      Estimate Std. Error z value Pr(>|z|)   
## (phi)_sitesite2 == 0    1.084      1.080   1.004  0.49442   
## (phi)_sitesite3 == 0    3.443      1.155   2.982  0.00552 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Note that for glht() the phi coefficient names have to be quoted. Also the mcp() interface for constructing sets of hypotheses does not work for betareg because there are two submodels (mean and phi) and not just one.

Best Answer

Related Solutions

Solved – How to calculate pseudo-$R^2$ from R’s logistic regression

Solved – post-hoc test for betareg model R

Related Question