Solved – Residuals for logistic regression and Cook’s distance

cooks-distancediagnosticlogisticregressionresiduals

  1. Are there any particular assumptions regarding the errors for logistic regression such as the constant variance of the error terms and the normality of the residuals?

  2. Also typically when you have points that have a Cook's distance larger than 4/n, do you remove them? If you do remove them, how can you tell if the model with the removed points is better?

Best Answer

I don't know if I can give you a complete answer, but I can give you some thoughts that may be helpful. First, all statistical models / tests have assumptions. However, logistic regression very much does not assume the residuals are normally distributed nor that the variance is constant. Rather, it is assumed that the data are distributed as a binomial, $\mathcal{B}(n_{x_i},p_{x_i})$, that is, with the number of Bernoulli trials equal to the number of observations at that exact set of covariate values and with the probability associated with that set of covariate values. Remember that the variance of a binomial is $np(1-p)$. Thus, if the $n$'s vary at different levels of the covariate, the variances will as well. Further, if any of the covariates are at all related to the response variable, then the probabilities will vary, and thus, so will the variances. These are important facts about logistic regression.

Second, model comparisons are usually performed between models with different specifications (for example, with different sets of covariates included), not over different subsets of the data. To be honest, I am not sure how that would properly be done. With a linear model, you could look at the 2 $R^2$s to see how much better the fit is with the aberrant data excluded, but this would only be descriptive, and you should know that $R^2$ would have to go up. With logistic regression, the standard $R^2$ cannot be used, however. There are various 'pseudo-$R^2$s' that have been developed to provide similar information, but they are often considered to be flawed and are not often used. For an overview of the different pseudo-$R^2$s that exist, see here. For some discussion, and criticism, of them, see here. Another possibility might be to jackknife the betas with and without the outliers included to see how excluding them contributes to stabilizing their sampling distributions. Once again, this would only be descriptive (i.e., it wouldn't constitute a test to tell you which model--er, subset of your data--to prefer) and the variance would have to go down. These things are true, for both pseudo-$R^2$s and the jackknifed distributions, because you selected those data to exclude based on the fact that they appear extreme.

Related Question