Beta Regression – Calculating Different Pseudo-$R^2$ for a Betareg Model

beta-regressionpseudo-r-squared

Sorry if this is a bit long..
I've been trying to fit models predicting the % of area infested in a field (response between 0 and 100%, total of 61 fields), with four explanatory variables, two factorials (planting and monitor) and two covariates (area2peri and Tmin7_bef).

I used the betareg() with link = "cauchit" (after comparing LL and AIC for the same predictors with different link functions), and tried the different combinations of parameters for both mean and dispersion sub-models. Would like some help understanding the possibilities of using different pseudo-$R^2$ for the different models. I am referring to the table on the UCLA website.

There are a few things that are not clear to me, and looking at betareg documentation did not solve them..

  1. The summary(betareg(...)) provides a pseudo-$R^2$ square that is the "squared correlation of linear predictor and link-transformed response". How is that calculated? and to which type of the pseudo-$R^2$ it relates (Efron's,McFadden's, Cox&Snell etc.)?

I tried calculating the different types myself, as shown in the betareg documentation p.20, fitted a null (intercept only) and full models and extracted the LogLik for both. However there were several issues:

  1. The formula suggested for Mcfadden's $R^2$ in betareg is the inverse of the one presented in the UCLA website, the null model's LL is the numerator in betareg and the denominator in UCLA.. what am I missing?

Here are the results for the different $R^2$, as well as the full model summary:

    Call:
    betareg(formula = A2to5 ~ planting + area2peri_m + monitor + Tmin7_bef | planting + monitor, data = na.omit(pre_n0), 
        link = "cauchit")
    
    Standardized weighted residuals 2:
        Min      1Q  Median      3Q     Max 
    -2.1745 -0.6122  0.1443  0.9129  1.6225 
    
    Coefficients (mean model with cauchit link):
                  Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  -0.766418   0.714323  -1.073   0.2833  
    plantinglate -1.013075   0.409167  -2.476   0.0133 *
    area2peri_m  -0.013529   0.006384  -2.119   0.0341 *
    monitor1      0.712917   0.277825   2.566   0.0103 *
    Tmin7_bef     0.111958   0.049065   2.282   0.0225 *
    
    Phi coefficients (precision model with log link):
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)    0.9635     0.2577   3.738 0.000185 ***
    plantinglate   0.4337     0.3331   1.302 0.192929    
    monitor1      -0.2295     0.3145  -0.730 0.465674    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 12.55 on 8 Df
Pseudo R-squared: 0.07703
Number of iterations: 31 (BFGS) + 1 (Fisher scoring)


Mfull <-as.vector(logLik(betareg_cauchit)) # 12.54867
Mintercept <- as.vector(logLik(betareg_cauchit_intonly)) #4.207168
n <- betareg_cauchit_intonly$n #61

## McFadden's pseudo-R-squared (explained portion of variance)- according to betreg documentation
1 -(Mintercept/Mfull) #0.6647

## McFadden's pseudo-R-squared - according to UCLA
1 -(Mfull/Mintercept) #-1.9827

## adjusted McFadden's pseudo-R-squared
1-(Mfull-8)/Mintercept #-0.08117

## Cox&Snell (improvment over null model)
1-((Mintercept/Mfull)^(2/n)) #0.0352

max_cox_senll <- 1-Mintercept^(2/n) #-0.0482

## Cragg & Uhler’s (improvment over null model)
(1-((Mintercept/Mfull)^(2/n)))/max_cox_senll #-0.73

Which brings me to:

  1. In all of my calculation attempts, hopefully not erroneous, I did not achieve the summary's pseudo-$R^2$ (relating to the first question) but what is the meaning of a negative value? Is my fit that bad?

  2. Finally, I get that there is no consensus of which type to use, or should one report the pseudo-$R^2$ at all, but how can I really judge if my models is able to explain something in the world?

Best Answer

  1. The pseudo-R-squared reported by betareg is the squared correlation of the linear predictor and the link-transformed response (default link: logit). For the mj_vd model from example("MockJurors", package = "betareg") that you cite, this can be replicated via:

    summary(mj_vd) ## reports: Pseudo R-squared: 0.03885
    cor(qlogis(MockJurors$confidence), predict(mj_vd, type = "link"))^2 ## [1] 0.03885128
    

    This pseudo-R-squared is not explicitly mentioned in the UCLA website but it is of type 3 (square of correlation).

  2. The McFadden pseudo-R-squared provided on the UCLA website is for discrete data where the likelihood contributions are actually probabilities between 0 and 1 (and thus the corresponding log-likelihood are negative). This is not the case for beta regression. In this case Smithson & Verkuilen (2006) also call this "proportional reduction of error (PRE)". This has the inverse ratio of the log-likelihoods. See also: calculating a pseudo R2 value when deviance is negative

  3. The negative value for the McFadden pseudo-R-squared is due to the inverse ratio of log-likelihoods. This only works if both are negative (e.g., as in logistic regression) but here both log-likelihoods are positive.

  4. I generally find the pseudo-R-squared to be of rather limited use in beta regression. Whether the model fit is useful (for you) depends on many aspects, e.g., whether you mostly want a model for the mean or you are really interested in a probabilistic fit. A comprehensive answer for this is beyond the scope of this question, though. Alternative means of model comparisons could include information criteria (AIC or BIC), scores for the mean predicitons like (root) mean-squared error (MSE, RMSE), or probabilistic scoring rules (CRPS, log-score), or graphical model assessments (quantile-quantile plots, PIT histograms, etc.).