Solved – Understanding Cook’s Distance

cooks-distancehigh-dimensionaloutliersr

I'm trying to use Cook's Distance in order to detect outliers in high-dimensional datasets.

However, I've found some troubles in order to do such thing. Usually, once I've built the linear model and compute Cook's Distance, all I get is a vector full of NaN values.

I've created a fictional example to show my problem.

set.seed(100)

data <- as.data.frame(cbind(Class=sample(c(1,2),100,replace=T),matrix(runif(10000,min=-10,max=100),nrow=100)))

With this dataset, I've computed Cook's distance with two different subsets of markers:

# Full of NaN values
mod <- lm(formula=Class ~ ., data=data)
cooksd <- cooks.distance(mod)
print(cooksd)

, and

# Values different from NaN
mod <- lm(formula=Class ~ ., data=data[,c(1:51)])
cooksd <- cooks.distance(mod)
print(cooksd)

So, I don't know if:

  1. Cook's distance is not suitable for this kind scenarios
  2. A vector full of NaN values mean that there is no influence between points
  3. Cook's distance is sensitive to high number of features

Best Answer

The problem here is not just that Cook's distance is not suitable for this kind of scenario: even classical linear regression is not suitable. Here you have over 100 predictors but just 10 observations. Therefore, ordinary least squares doesn't yield a single solution and variances of parameters and residuals can't be estimated. You can see that for your example data even one parameter can't be estimated:

> summary(mod)

Call:
lm(formula = Class ~ ., data = data)

Residuals:
ALL 100 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.951e+00         NA      NA       NA
V2          -8.333e-03         NA      NA       NA
V3          -2.042e-02         NA      NA       NA
V4          -8.088e-03         NA      NA       NA
V5           2.487e-03         NA      NA       NA
V6          -2.148e-04         NA      NA       NA

(90 lines omitted to keep the post readable)

V97         -1.468e-02         NA      NA       NA
V98          1.155e-02         NA      NA       NA
V99          8.612e-03         NA      NA       NA
V100        -1.547e-04         NA      NA       NA
V101                NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 99 and 0 DF,  p-value: NA

Due to the same reason, Cook's distance can't be computed. A definition of Cook's distance is:

$$D_i = \frac{e_{i}^{2}}{s^{2} p}\left[\frac{h_{i}}{(1-h_{i})^2}\right]$$

where $$s^{2} \equiv \left( n - p \right)^{-1} \mathbf{e}^{\top} \mathbf{e}$$ is the mean squared error of the regression model. Therefore, if the mean squared error of the regression model can't be computed or it is estimated to be zero, Cook's distance can't be computed and R just gives NaN.

I suggest using a different method to adjust your model, like Lasso or partial least squares - unless you can get more observations than parameters.