Sorry if this is a bit long..
I've been trying to fit models predicting the % of area infested in a field (response between 0 and 100%, total of 61 fields), with four explanatory variables, two factorials (planting
and monitor
) and two covariates (area2peri
and Tmin7_bef
).
I used the betareg()
with link = "cauchit"
(after comparing LL and AIC for the same predictors with different link functions), and tried the different combinations of parameters for both mean and dispersion sub-models. Would like some help understanding the possibilities of using different pseudo-$R^2$ for the different models. I am referring to the table on the UCLA website.
There are a few things that are not clear to me, and looking at betareg
documentation did not solve them..
- The
summary(betareg(...))
provides a pseudo-$R^2$ square that is the "squared correlation of linear predictor and link-transformed response". How is that calculated? and to which type of the pseudo-$R^2$ it relates (Efron's,McFadden's, Cox&Snell etc.)?
I tried calculating the different types myself, as shown in the betareg
documentation p.20, fitted a null (intercept only) and full models and extracted the LogLik for both. However there were several issues:
- The formula suggested for Mcfadden's $R^2$ in
betareg
is the inverse of the one presented in the UCLA website, the null model's LL is the numerator inbetareg
and the denominator in UCLA.. what am I missing?
Here are the results for the different $R^2$, as well as the full model summary:
Call:
betareg(formula = A2to5 ~ planting + area2peri_m + monitor + Tmin7_bef | planting + monitor, data = na.omit(pre_n0),
link = "cauchit")
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-2.1745 -0.6122 0.1443 0.9129 1.6225
Coefficients (mean model with cauchit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.766418 0.714323 -1.073 0.2833
plantinglate -1.013075 0.409167 -2.476 0.0133 *
area2peri_m -0.013529 0.006384 -2.119 0.0341 *
monitor1 0.712917 0.277825 2.566 0.0103 *
Tmin7_bef 0.111958 0.049065 2.282 0.0225 *
Phi coefficients (precision model with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9635 0.2577 3.738 0.000185 ***
plantinglate 0.4337 0.3331 1.302 0.192929
monitor1 -0.2295 0.3145 -0.730 0.465674
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 12.55 on 8 Df
Pseudo R-squared: 0.07703
Number of iterations: 31 (BFGS) + 1 (Fisher scoring)
Mfull <-as.vector(logLik(betareg_cauchit)) # 12.54867
Mintercept <- as.vector(logLik(betareg_cauchit_intonly)) #4.207168
n <- betareg_cauchit_intonly$n #61
## McFadden's pseudo-R-squared (explained portion of variance)- according to betreg documentation
1 -(Mintercept/Mfull) #0.6647
## McFadden's pseudo-R-squared - according to UCLA
1 -(Mfull/Mintercept) #-1.9827
## adjusted McFadden's pseudo-R-squared
1-(Mfull-8)/Mintercept #-0.08117
## Cox&Snell (improvment over null model)
1-((Mintercept/Mfull)^(2/n)) #0.0352
max_cox_senll <- 1-Mintercept^(2/n) #-0.0482
## Cragg & Uhler’s (improvment over null model)
(1-((Mintercept/Mfull)^(2/n)))/max_cox_senll #-0.73
Which brings me to:
-
In all of my calculation attempts, hopefully not erroneous, I did not achieve the summary's pseudo-$R^2$ (relating to the first question) but what is the meaning of a negative value? Is my fit that bad?
-
Finally, I get that there is no consensus of which type to use, or should one report the pseudo-$R^2$ at all, but how can I really judge if my models is able to explain something in the world?
Best Answer
The pseudo-R-squared reported by
betareg
is the squared correlation of the linear predictor and the link-transformed response (default link: logit). For themj_vd
model fromexample("MockJurors", package = "betareg")
that you cite, this can be replicated via:This pseudo-R-squared is not explicitly mentioned in the UCLA website but it is of type 3 (square of correlation).
The McFadden pseudo-R-squared provided on the UCLA website is for discrete data where the likelihood contributions are actually probabilities between 0 and 1 (and thus the corresponding log-likelihood are negative). This is not the case for beta regression. In this case Smithson & Verkuilen (2006) also call this "proportional reduction of error (PRE)". This has the inverse ratio of the log-likelihoods. See also: calculating a pseudo R2 value when deviance is negative
The negative value for the McFadden pseudo-R-squared is due to the inverse ratio of log-likelihoods. This only works if both are negative (e.g., as in logistic regression) but here both log-likelihoods are positive.
I generally find the pseudo-R-squared to be of rather limited use in beta regression. Whether the model fit is useful (for you) depends on many aspects, e.g., whether you mostly want a model for the mean or you are really interested in a probabilistic fit. A comprehensive answer for this is beyond the scope of this question, though. Alternative means of model comparisons could include information criteria (AIC or BIC), scores for the mean predicitons like (root) mean-squared error (MSE, RMSE), or probabilistic scoring rules (CRPS, log-score), or graphical model assessments (quantile-quantile plots, PIT histograms, etc.).