Solved – How to calculate pseudo-$R^2$ from R’s logistic regression

likelihoodlogisticpseudo-r-squaredr

Christopher Manning's writeup on logistic regression in R shows a logistic regression in R as follows:

ced.logr <- glm(ced.del ~ cat + follows + factor(class), 
  family=binomial)

Some output:

> summary(ced.logr)
Call:
glm(formula = ced.del ~ cat + follows + factor(class),
    family = binomial("logit"))
Deviance Residuals:
Min            1Q    Median       3Q      Max
-3.24384 -1.34325   0.04954  1.01488  6.40094

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)   -1.31827    0.12221 -10.787 < 2e-16
catd          -0.16931    0.10032  -1.688 0.091459
catm           0.17858    0.08952   1.995 0.046053
catn           0.66672    0.09651   6.908 4.91e-12
catv          -0.76754    0.21844  -3.514 0.000442
followsP       0.95255    0.07400  12.872 < 2e-16
followsV       0.53408    0.05660   9.436 < 2e-16
factor(class)2 1.27045    0.10320  12.310 < 2e-16
factor(class)3 1.04805    0.10355  10.122 < 2e-16
factor(class)4 1.37425    0.10155  13.532 < 2e-16
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 958.66 on 51 degrees of freedom
Residual deviance: 198.63 on 42 degrees of freedom
AIC: 446.10
Number of Fisher Scoring iterations: 4

He then goes into some detail about how to interpret coefficients, compare different models, and so on. Quite useful.

However, how much variance does the model account for? A Stata page on logistic regression says:

Technically, $R^2$ cannot be computed the same way in logistic regression as it is in OLS regression. The pseudo-$R^2$, in logistic regression, is defined as $1 – \frac{L1}{L0}$, where $L0$ represents the log likelihood for the "constant-only" model and $L1$ is the log likelihood for the full model with constant and predictors.

I understand this at the high level. The constant-only model would be without any of the parameters (only the intercept term). Log likelihood is a measure of how closely the parameters fit the data. In fact, Manning sort of hints that the deviance might be $-2 \log L$. Perhaps null deviance is constant-only and residual deviance is $-2 \log L$ of the model? However, I'm not crystal clear on it.

Can someone verify how one actually computes the pseudo-$R^2$ in R using this example?

Best Answer

Don't forget the rms package, by Frank Harrell. You'll find everything you need for fitting and validating GLMs.

Here is a toy example (with only one predictor):

set.seed(101)
n <- 200
x <- rnorm(n)
a <- 1
b <- -2
p <- exp(a+b*x)/(1+exp(a+b*x))
y <- factor(ifelse(runif(n)<p, 1, 0), levels=0:1)
mod1 <- glm(y ~ x, family=binomial)
summary(mod1)

This yields:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.8959     0.1969    4.55 5.36e-06 ***
x            -1.8720     0.2807   -6.67 2.56e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 258.98  on 199  degrees of freedom
Residual deviance: 181.02  on 198  degrees of freedom
AIC: 185.02

Now, using the lrm function,

require(rms)
mod1b <- lrm(y ~ x)

You soon get a lot of model fit indices, including Nagelkerke $R^2$, with print(mod1b):

Logistic Regression Model

lrm(formula = y ~ x)

                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       

Obs           200    LR chi2      77.96    R2       0.445    C       0.852    
 0             70    d.f.             1    g        2.054    Dxy     0.705    
 1            130    Pr(> chi2) <0.0001    gr       7.801    gamma   0.705    
max |deriv| 2e-08                          gp       0.319    tau-a   0.322    
                                           Brier    0.150                     


          Coef    S.E.   Wald Z Pr(>|Z|)
Intercept  0.8959 0.1969  4.55  <0.0001 
x         -1.8720 0.2807 -6.67  <0.0001

Here, $R^2=0.445$ and it is computed as $\left(1-\exp(-\text{LR}/n)\right)/\left(1-\exp(-(-2L_0)/n)\right)$, where LR is the $\chi^2$ stat (comparing the two nested models you described), whereas the denominator is just the max value for $R^2$. For a perfect model, we would expect $\text{LR}=2L_0$, that is $R^2=1$.

By hand,

> mod0 <- update(mod1, .~.-x)
> lr.stat <- lrtest(mod0, mod1)
> (1-exp(-as.numeric(lr.stat$stats[1])/n))/(1-exp(2*as.numeric(logLik(mod0)/n)))
[1] 0.4445742
> mod1b$stats["R2"]
       R2 
0.4445742

Ewout W. Steyerberg discussed the use of $R^2$ with GLM, in his book Clinical Prediction Models (Springer, 2009, § 4.2.2 pp. 58-60). Basically, the relationship between the LR statistic and Nagelkerke's $R^2$ is approximately linear (it will be more linear with low incidence). Now, as discussed on the earlier thread I linked to in my comment, you can use other measures like the $c$ statistic which is equivalent to the AUC statistic (there's also a nice illustration in the above reference, see Figure 4.6).

Related Solutions

Solved – How to understand output from R’s polr function (ordered logistic regression)

I would suggest you look at books on categorical data analysis (cf. Alan Agresti's Categorical Data Analysis, 2002) for better explanation and understanding of ordered logistic regression. All the questions that you ask are basically answered by a few chapters in such books. If you are only interested in R related examples, Extending Linear Models in R by Julian Faraway (CRC Press, 2008) is a great reference.

Before I answer your questions, ordered logistic regression is a case of multinomial logit models in which the categories are ordered. Suppose we have $J$ ordered categories and that for individual $i$, with ordinal response $Y_i$, $p_{ij}=P(Yi=j)$ for $j=1,..., J$. With an ordered response, it is often easier to work with the cumulative probabilities, $\gamma_{ij}=P(Y_i \le j)$. The cumulative probabilities are increasing and invariant to combining adjacent categories. Furthermore, $\gamma_{iJ}=1$, so we need only model $J–1$ probabilities.

Now we want to link $\gamma_{ij}$s to covariates $x$. In your case, Sat has 3 ordered levels: low, medium, high. It makes more sense to treat them as ordered rather than unordered. The remaining variables are your covariates. The specific model that you are considering is the proportional odds model and is mathematically equivalent to:

$$\mbox{logit } \gamma_j(x_i) = \theta_j - \beta^T x_i, j = 1 \ldots J-1$$ $$\mbox{where }\gamma_j(x_i)=P(Y_i \le j | x_i)$$

It is so called because the relative odds for $Y \le j$ comparing $x_1$ and $x_2$ are:

$$\left(\frac {\gamma_j(x_1)}{1-\gamma_j(x_1)}\right) / \left(\frac {\gamma_j(x_2)}{1-\gamma_j(x_2)}\right)=\exp(-\beta^T (x_1-x_2))$$

Notice, the above expression does not depend on $j$. Of course, the assumption of proportional odds does need to be checked for a given dataset.

Now, I will answer some (1, 2, 4) questions.

How can one understand if the model gave a good fit? summary(house.plr) shows Residual Deviance 3479.149 and AIC (Akaike Information Criterion?) of 3495.149. Is that good? In the case those are only useful as relative measures (i.e. to compare to another model fit), what is a good absolute measure? Is the residual deviance approximately chi-squared distributed? Can one use "% correctly predicted" on the original data or some cross-validation? What is the easiest way to do that?

A model fit by polr is a special glm, so all the assumptions that hold for a traditional glm hold here. If you take care of the parameters properly, you can figure out the distribution. Specifically, to test the if the model is good or not you may want to do a goodness of fit test, which test the following null (notice this is subtle, mostly you want to reject the null, but here you don't want to reject it to get a good fit):

$$H_o: \mbox{ current model is good enough }$$

You would use the chi-square test for this. The p-value is obtained as:

1-pchisq(deviance(house.plr),df.residual(house.plr))

Most of the time you'd hope to obtain a p-value greater than 0.05 so that you don't reject the null to conclude that the model is good fit (philosophical correctness is ignored here).

AIC should be high for a good fit at the same time you don't want to have a large number of parameters. stepAIC is a good way to check this.

Yes, you can definitely use cross validation to see if the predictions hold. See predict function (option: type = "probs") in ?polr. All you need to take care of is the covariates.

What information does pr contain? The help page on profile is generic, and gives no guidance for polr

As pointed by @chl and others, pr contains all the information needed for obtaining CIs and other likelihood related information of the polr fit. All glms are fit using iteratively weighted least square estimation method for the log likelihood. In this optimization you obtain a lot of information (please see the references) which will be needed for calculating Variance Covariance Matrix, CI, t-value etc. It includes all of it.

How does one interpret the t values for each coefficient? Unlike some model >fits, there are no P values here.

Unlike normal linear model (special glm) other glms are don't have the nice t-distribution for the regression coefficients. Therefore all you can get is the parameter estimates and their asymptotic variance covariance matrix using the max-likelihood theory. Therefore:

$$\text{Variance}(\hat \beta) = (X^T W X)^{-1}\hat \phi$$

Estimate divided by its standard error is what BDR and WV call t-value (I am assuming MASS convention here). It is equivalent to t-value from normal linear regression but does not follow a t-distribution. Using CLT, it is asymptotically normally distributed. But they prefer not to use this approx (I guess), hence no p-values. (I hope I am not wrong, and if I am, I hope BDR is not on this forum. I further hope, someone will correct me if I am wrong.)

Solved – How to determine the accuracy of logistic regression in R

If you want to assess accuracy, one way is to look at the predicted outcome vs. the actual outcome. You can get the predicted values with fitted-values and then compare them to the actual values; for one example see this page:

Best Answer

Related Solutions

Solved – How to understand output from R’s polr function (ordered logistic regression)

Solved – How to determine the accuracy of logistic regression in R

Related Question