Solved – Cook’s distance in detecting outliers

cooks-distancegroup-differencesmixed modeloutliers

According to my understanding, Cook's distance measures the influence of each observation by excluding points when fitting a model. So I assume it could be an reasonable approach for outlier detection?

My questions, assume data are categorized into groups, is it possible to use Cook's distance on detecting the "outlier" group instead of outlier point? Is Cook's distance a good choice of measuring group influence.

Best Answer

Like you said Cook’s Distance measures the change in the regression by removing each individual point. If things change quite a bit by the omission of a single point, than that point was having a lot of influence on your model. Define $\hat{Y}_{j(i)}$ to be the fitted value for the jth observation when the ith observation is deleted from the data set. Cook’s Distance measures how much $i$ changes all the predictions.

$$D_i = \frac{\sum_{j=1}^{n}\hat{Y}_j - \hat{Y}_{j(i)})^2}{pMSE}$$ $$= \frac{e_i^2}{pMSE}[\frac{h_{ii}}{(1-h_{ii})^2}]$$

If $D_i \geq 1$ it is extreme (for small to medium datasets).

Cook’s Distance shows the effect of the ith case on all the fitted values. Note that the ith case can be influenced by

big $e_i$ and moderate $h_{ii}$
moderate $e_i$ and big $h_{ii}$
big $e_i$ and big $h_{ii}$

In R, use the influence.measures package with cooks.distance(model)

Best Answer

Related Solutions

Outliers – Correcting for Outliers in a Running Average Calculation

Outliers Spotting – Should Pre-Process Data in Time Series Analysis?

Related Question