Solved – Logistic Regression – Getting Pearson Standardized Residuals in R vs Stata

logisticrresidualsstata

I am working on an assignment involving a logistic regression model, where I need to plot the pearson standardized residuals against one of the predictors. Here's the basic setup:

model <- glm(outcome ~ predictor1 + predictor2, family=binomial(logit))
res <- residuals(model, "pearson")

When looking at the residuals' distribution, I see something totally different than my colleagues who use Stata (using predict and rstandard). Their residuals are more or less normal, whereas in mine there is a gap in the values (not a singe residual is between -0.05 and 1.15). That does make sense in the context of logistic regression, especially that the maximum predicted probability is not so high (38%).

I'd like to understand what's happening here… What is Stata doing that R isn't, with those residuals?

Best Answer

For logistic regression, Stata defines residuals and related quantities to be those you'd get if you grouped all the observations with the same values for all the predictor variables, counted up the successes and failures for those observations, and fitted a logistic regression model to the resulting binomial data instead of the original Bernoulli data. This is a useful thing to do as (if there are multiple observations with the same covariate pattern) the resulting residuals behave more like those you're used to from least squares.

To get the same residuals from R, I suspect you will need to group the data and fit the model to the grouped data. But I'm not clear whether R is using the same definition of 'standardized residuals' as Stata as I don't presently have access to the numerous textbooks that the R documentation references.

Here's an excerpt from 'Methods and formulas' section of the Stata manual entry for 'logistic postestimation' (one thing I like about Stata is that the manuals provide the full formulas for everything):

Define $M_j$ for each observation as the total number of observations sharing $j$’s covariate pattern. Define $Y_j$ as the total number of positive responses among observations sharing $j$’s covariate pattern.

The Pearson residual for the $j$th observation is defined as $$r_j = \frac{Y_j - M_j p_j}{\sqrt{M_j p_j(1 - p_j)}}$$ ...
The unadjusted diagonal elements of the hat matrix $h_{Uj}$ are given by $h_{Uj} = (\mathbf{XVX}')_{jj}$, where $\mathbf{V}$ is the estimated covariance matrix of parameters. The adjusted diagonal elements $h_j$ created by hat are then $h_j = M_j p_j(1 - p_j)h_{Uj}$.
The standardized Pearson residual $r_{Sj}$ is $r_j / \sqrt{1 - h_j}.$