Solved – Understanding Cook’s Distance

cooks-distancehigh-dimensionaloutliersr

I'm trying to use Cook's Distance in order to detect outliers in high-dimensional datasets.

However, I've found some troubles in order to do such thing. Usually, once I've built the linear model and compute Cook's Distance, all I get is a vector full of NaN values.

I've created a fictional example to show my problem.

set.seed(100)

data <- as.data.frame(cbind(Class=sample(c(1,2),100,replace=T),matrix(runif(10000,min=-10,max=100),nrow=100)))

With this dataset, I've computed Cook's distance with two different subsets of markers:

# Full of NaN values
mod <- lm(formula=Class ~ ., data=data)
cooksd <- cooks.distance(mod)
print(cooksd)

, and

# Values different from NaN
mod <- lm(formula=Class ~ ., data=data[,c(1:51)])
cooksd <- cooks.distance(mod)
print(cooksd)

So, I don't know if:

Cook's distance is not suitable for this kind scenarios
A vector full of NaN values mean that there is no influence between points
Cook's distance is sensitive to high number of features

Best Answer

The problem here is not just that Cook's distance is not suitable for this kind of scenario: even classical linear regression is not suitable. Here you have over 100 predictors but just 10 observations. Therefore, ordinary least squares doesn't yield a single solution and variances of parameters and residuals can't be estimated. You can see that for your example data even one parameter can't be estimated:

> summary(mod)

Call:
lm(formula = Class ~ ., data = data)

Residuals:
ALL 100 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.951e+00         NA      NA       NA
V2          -8.333e-03         NA      NA       NA
V3          -2.042e-02         NA      NA       NA
V4          -8.088e-03         NA      NA       NA
V5           2.487e-03         NA      NA       NA
V6          -2.148e-04         NA      NA       NA

(90 lines omitted to keep the post readable)

V97         -1.468e-02         NA      NA       NA
V98          1.155e-02         NA      NA       NA
V99          8.612e-03         NA      NA       NA
V100        -1.547e-04         NA      NA       NA
V101                NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 99 and 0 DF,  p-value: NA

Due to the same reason, Cook's distance can't be computed. A definition of Cook's distance is:

$$D_i = \frac{e_{i}^{2}}{s^{2} p}\left[\frac{h_{i}}{(1-h_{i})^2}\right]$$

where $$s^{2} \equiv \left( n - p \right)^{-1} \mathbf{e}^{\top} \mathbf{e}$$ is the mean squared error of the regression model. Therefore, if the mean squared error of the regression model can't be computed or it is estimated to be zero, Cook's distance can't be computed and R just gives NaN.

I suggest using a different method to adjust your model, like Lasso or partial least squares - unless you can get more observations than parameters.

Related Solutions

Solved – Cook’s distance and $R^2$

One shouldn't necessarily expect to find that $R^2$ improves by deleting an influential outlier; $R^2$ has a numerator and a denominator, and both are impacted by points with high Cook's distance.

It's easy to pick up a somewhat mistaken conception of $R^2$; this may lead you to have an expectation of $R^2$ that isn't the case.

As I mentioned, $R^2$ has a numerator and a denominator; adding an influential outlier will greatly increase the variation in the data (increasing the denominator). You might expect that would reduce $R^2$ -- but at the same time, if the point is sufficiently influential, almost all of that additional variation in the data will be explained by a line going through, or nearly through the outlier.

This may be easiest to see with an example.

Consider the following data:

no influential outlier

This has an $R^2$ of 91.6%

Now add a highly influential outlier to the above data:

    x       y
  100 -100.00

influential outlier

This has an $R^2$ of 96.4%

While the denominator of the $R^2$ increased from 88.07 to 10137, the numerator increased from 80.68 to 9769 - most of the variation in the data (over 90% of it!) is contributed by one observation, and that one is fitted quite well; this drives $R^2$.

To see that the fit to the rest of the data is actually much worse, simply compare their residuals; that lack of fit does very little to pull down $R^2$.

This example demonstrates not only that it can happen that $R^2$ can increase by adding an influential outlier, but shows how it can happen. (Conversely, if we start with the second data set and delete the influential outlier, $R^2$ will go down.)

It should serve as a cautionary tale - beware of interpreting $R^2$ as fit in any intuitive sense; it does measure a kind of fit, but it's a very particular measure of it, and the behaviour of that measure may not match your personal intuition.

Solved – Cook’s distance cut-off value

I would probably go with your original model with your full dataset. I generally think of these things as facilitating sensitivity analyses. That is, they point you towards what to check to ensure that you don't have a given result only because of something stupid. In your case, you have some potentially influential points, but if you rerun the model without them, you get substantively the same answer (at least with respect to the aspects that you presumably care about). In other words, use whichever threshold you like—you are only refitting the model as a check, not as the 'true' version. If you think that other people will be sufficiently concerned about the potential outliers, you could report both model fits. What you would say is along the lines of,

Here are my results. One might be concerned that this picture only emerges due to a couple unusual, but highly influential, observations. These are the results of the same model, but without those observations. There are no substantive differences.

It is also possible to remove them and use the second model as your primary result. After all, staying with the original dataset amounts to an assumption about which data belong in the model just as much as going with the subset. But people are likely to be very skeptical of your reported results because psychologically it is too easy for someone to convince themselves, without any actual corrupt intent, to go with the set of post-hoc tweaks (such as dropping some observations) that gives them the result they most expected to see. By always going with the full dataset, you preempt that possibility and assure people (say, reviewers) that that isn't what's going on in your project.

Another issue here is that people end up 'chasing the bubble'. When you drop some potential outliers, and rerun your model, you end up with results that show new, different observations as potential outliers. How many iterations are you supposed to go through? The standard response to this is that you should stay with your original, full dataset and run a robust regression instead. This again, can be understood as a sensitivity analysis.

Best Answer

Related Solutions

Solved – Cook’s distance and $R^2$

Solved – Cook’s distance cut-off value

Related Question