Solved – Cook’s distance cut-off value

cooks-distanceoutliers

I have been reading on cook's distance to identify outliers which have high influence on my regression. In Cook's original study he says that a cut-off rate of 1 should be comparable to identify influencers. However, various other studies use $\frac{4}{n}$ or $\frac{4}{n-k-1}$ as a cut-off.

In my study, none of my residuals have a D higher than 1. However, if I use $\frac{4}{n}$ as a cutoff $(\frac{4}{149}= .026)$, then there are various data points which are considered influencers. I decided to test whether removing these data points would make a difference to my general linear regression. All my IVs retained their significance and no obvious change was apparent.

Should I retain all my data points and use the cut-off rate of 1? Or remove them?

Best Answer

I would probably go with your original model with your full dataset. I generally think of these things as facilitating sensitivity analyses. That is, they point you towards what to check to ensure that you don't have a given result only because of something stupid. In your case, you have some potentially influential points, but if you rerun the model without them, you get substantively the same answer (at least with respect to the aspects that you presumably care about). In other words, use whichever threshold you like—you are only refitting the model as a check, not as the 'true' version. If you think that other people will be sufficiently concerned about the potential outliers, you could report both model fits. What you would say is along the lines of,

Here are my results. One might be concerned that this picture only emerges due to a couple unusual, but highly influential, observations. These are the results of the same model, but without those observations. There are no substantive differences.

It is also possible to remove them and use the second model as your primary result. After all, staying with the original dataset amounts to an assumption about which data belong in the model just as much as going with the subset. But people are likely to be very skeptical of your reported results because psychologically it is too easy for someone to convince themselves, without any actual corrupt intent, to go with the set of post-hoc tweaks (such as dropping some observations) that gives them the result they most expected to see. By always going with the full dataset, you preempt that possibility and assure people (say, reviewers) that that isn't what's going on in your project.

Another issue here is that people end up 'chasing the bubble'. When you drop some potential outliers, and rerun your model, you end up with results that show new, different observations as potential outliers. How many iterations are you supposed to go through? The standard response to this is that you should stay with your original, full dataset and run a robust regression instead. This again, can be understood as a sensitivity analysis.

Related Question