PCA – Robust PCA vs. Robust Mahalanobis Distance for Outlier Detection

anomaly detectioncovariance-matrixoutlierspcarobust

Robust PCA (as developed by Candes et al 2009 or better yet Netrepalli et al 2014) is a popular method for multivariate outlier detection, but Mahalanobis distance can also be used for outlier detection given a robust, regularized estimate of the covariance matrix. I'm curious about the (dis)advantages of using one method over the other.

My intuition tells me that the greatest distinction between the two is the following: When the data set is "small" (in a statistical sense), robust PCA will give a lower-rank covariance while robust covariance matrix estimation will instead give a full-rank covariance due to the Ledoit-Wolf regularization. How does this in turn affect outlier detection?

Best Answer

This paper compares some methods in this area. They refer to the Robust PCA approach you linked to as "PCP" (principal components pursuit) and the family of methods you linked to for robust covariance estimation as M-estimators.

They argue that

PCP is designed for uniformly corrupted coordinates of data, instead of corrupted data points (i.e., outliers), therefore, the comparison with PCP is somewhat unfair for this kind of data

and show that PCP (aka robust PCA) can fail for outlier detection in some cases.

They also talk about three kinds of "enemies of subspace recovery," i.e. different kinds of outliers, and which kinds of methods might do well for dealing with each one. Comparing your own outliers with the three kinds of "enemies" discussed here might help you pick an approach.

Related Solutions

Solved – Multivariate Outlier Detection with Robust Mahalanobis

What you are trying to do stems from the following basic idea:

Compute robust distances (using Minimum Covariance Determinant, MCD estimate) of each observation $$rd(x_{i}),\ i = 1 \ldots n,\ x_i \in \mathbb{R}^p, $$ where $n$ and $p$ are the number of observations (rows) and variables (columns) respectively.
Compare each $rd(x_{i})$ to $\sqrt{\chi^2_{p,.975}}$. Declare $x_i$ an outlier if $$rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$$
The robust distances are given by: $$rd(x_i) = \sqrt{(x_i - \mu_{mcd} )'S_{mcd}^{-1} (x_i - \mu_{mcd} )}$$ where $ \mu_{mcd}$ and $S_{mcd}$ are the robust MCD estimates of location (mean vector) and scatter (covariance matrix) respectively.

So therefore, what you need to do is:

A. Get the robust MCD estimate of location ($ \mu_{mcd}$) and scatter ($S_{mcd}$) for your data. You're on the way with the CovMCD(x) function in R. Note that this function implements the FastMCD algorithm by default which repeatedly takes subsamples of your data (each of size denoted by $h$) to make estimates. The number of subsamples to take by default in this function is 500 which explains the reason you're seeing 500 there. The function is not taking 500 observations out of your data. It is instead taking subsamples of size $n_{subsample} = h$ repeatedly to make estimates. 500 of these subsamples will be taken to make estimates. Check Peter J. Rousseeuw & Katrien Van Driessen (1999) A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41:2, 212-223 for more details.

B. Get the robust distances for each row or obsersavtion using the formula in item 3 above.

C. Compare each distance $rd(x_i)$ to $\sqrt{\chi^2_{p,.975}}$ and declare $x_i$ an outlier if $rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$.

And example R code is provided below:

require(rrcov)
data(hbk)
mcd <- rrcov::CovMcd(hbk[,1:3]) # use only first three columns  
# get mcd estimate of location
mean_mcd <- mcd@raw.center
# get mcd estimate scatter
cov_mcd <- mcd@raw.cov

# get inverse of scatter
cov_mcd_inv <- solve(cov_mcd)

# compute distances

# compute the robust distance
robust_dist <- apply(hbk[,1:3], 1, function(x){
  x <- (x - mean_mcd)
  dist <- sqrt((t(x)  %*% cov_mcd_inv %*% x))
  return(dist)
})

# set cutoff using chi square distribution
threshold <- sqrt(qchisq(p = 0.975, df = ncol(hbk[,1:3]))) # df = no of columns

# find outliers
outliers <-  which(robust_dist >= threshold) # gives the row numbers of outliers

Solved – How to choose the regularization parameter in ZCA whitening

If the data was Gaussian distributed with mean $0$ and unknown covariance $\Sigma$ and we put an inverse-Wishart prior on $\Sigma$, \begin{align} \Sigma &\sim \mathcal{W^{-1}}(\Psi, \nu), \\ x &\sim \mathcal{N}(0, \Sigma), \end{align} the posterior expectation of $\Sigma$ would be $$\frac{XX^\top + \Psi}{n + \nu - p - 1},$$ where $n$ is the number of data points and $p$ is the dimensionality of the data. Choosing $\Psi = I$ and $\nu = p + 1$, for example, we would get $$\frac{XX^\top + I}{n} = C + \frac{1}{n}I = L\left(D + \frac{1}{n}I\right)L^\top,$$ where $C = XX^\top/n$. A sensible choice for $\epsilon$ therefore might be $1/n$.

You could go one step further and properly estimate the covariance using a normal-inverse-Wishart prior, i.e., taking the uncertainty of the mean into account as well. Derivations for the posterior can be found in (Murphy, 2007).

Best Answer

Related Solutions

Solved – Multivariate Outlier Detection with Robust Mahalanobis

Solved – How to choose the regularization parameter in ZCA whitening

Related Question