Covariance – Robust Covariance and OGK Outlier Detection in Multivariate Analysis

covariance-matrixmachine learningmultivariate analysisoutliersrobust

I'm calculating the robust covariance of a data set in order to use mahalanobis distance for outlier detection. There are few methods to calculate the covariance in the equation. Using the Fast-MCD method I got non-deterministic results, which led me to use the OGK method (Orthogonalized Gnanadesikan Kettenring) that produced deterministic results.
I'm trying to understand the difference between Fast-MCD and OGK, and specifically how the OGK method works?

Best Answer

To justify the OGK, you need to see the following chain:

Outlier detection and robust estimation are essentially equivalent problems,
At its core, outlier detection is related to the problem of (center-outward) ranking of the data,
Ranking is much more complicated when you have several variables than if you only have one.

For these reasons, uni-variate robust estimators are much simpler than multivariate ones (including computationally).

The idea of the OGK is to construct a multivariate estimator of scatter (like the classical covariance matrix but robust to outliers) by combining $O(p^2)$ univariate outlier-robust estimates of scale (where $p$ is the number of variables in your data).

Intermezzo:

For the classical covariance matrix $\pmb V(\pmb X)$ of a data matrix $\bf X$ (with colunms $\{X_i\}_{i=1}^p$), it holds that:

$$\pmb V_{ii}(\pmb X) = s^2(X_i)$$

--where $s^2(X_i)$ is the classical univariate estimator of variance applied to the vector $X_i$-- and

$$\pmb V_{ij}(\pmb X) = \frac{1}{4}\left(s^2\left(X_i+X_j\right)-s^2\left(X_i-X_j\right)\right)$$

The OGK is based on two ideas.

Use the equations above to construct $\pmb C$ (a robust scatter matrix) by substituting $s^2()$ in the equations above by $m^2()$, a robust estimator of scale (for example the square of the MAD). This is the first idea.

The resulting matrix is only guaranteed positive semi-definite when $m()=s()$. So, in general, $\pmb C$ is not positive semi definite. This is an annoying flaw for an estimator of scatter. Therefore we need to apply some additional step, in the case of the OGK called orthogonalization step, to $\pmb C$ to find a positive semi-definite matrix that is, in some sense, 'close to $\pmb C$' --this is the second idea--:

Rescale your data, columnswise: $\pmb Y = \pmb D^{-1}\pmb X$ (where $\pmb D = \text{diag}(m^2(X_1),\ldots,m^2(X_p)$)
project the rescaled data on $\pmb E$, the eigenvector of $\pmb C(\pmb Y)$ (the robust scatter matrix computed using the formula above applied to the on rescaled data $Y$), i.e. $\pmb T=\pmb E^{\top}\pmb Y$;
compute a diagonal matrix of 'robust variances' $\pmb\Lambda=\text{diag}(m^2(T_1),\ldots,m^2(T_p))$;
set $\pmb\mu(\pmb X)=\pmb A\pmb m$ where $\pmb m=(m(T_1),\ldots,m(T_p))$ and compute the positive definite matrix $\pmb\varSigma(\pmb X)=\pmb A \pmb\Lambda \pmb A^T$, where $\pmb A= \pmb D\pmb E$.

Then, $\pmb\mu(\pmb X)$ and $\pmb\varSigma(\pmb X)$ are called respectively the raw OGK estimates of location and scatter. $\pmb\varSigma(\pmb X)$ is now guaranteed to be positive semi-definite. Moreover, it is in some sense close to $\pmb C(\pmb X)=\pmb D\pmb C(\pmb Y)\pmb D$.

The raw OGK estimates of location and scatter trivially inherit the breakdown point (a measure of robustness of an estimator to the presence of outliers in the data) of the measure of scale used in the computation of $\pmb C$ (for example the breakdown point is about 50%, which is best, when $m()$ is the MAD).

You asked for a comparison with another 50% breakdown point algorithm to estimate the scatter matrix robustly, FastMCD. Contrary to FastMCD, OGK is a deterministic algorithm. Also contrary to FastMCD, OGK is not affine equivariant.

Related Solutions

Solved – Multivariate Outlier Detection with Robust Mahalanobis

What you are trying to do stems from the following basic idea:

Compute robust distances (using Minimum Covariance Determinant, MCD estimate) of each observation $$rd(x_{i}),\ i = 1 \ldots n,\ x_i \in \mathbb{R}^p, $$ where $n$ and $p$ are the number of observations (rows) and variables (columns) respectively.
Compare each $rd(x_{i})$ to $\sqrt{\chi^2_{p,.975}}$. Declare $x_i$ an outlier if $$rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$$
The robust distances are given by: $$rd(x_i) = \sqrt{(x_i - \mu_{mcd} )'S_{mcd}^{-1} (x_i - \mu_{mcd} )}$$ where $ \mu_{mcd}$ and $S_{mcd}$ are the robust MCD estimates of location (mean vector) and scatter (covariance matrix) respectively.

So therefore, what you need to do is:

A. Get the robust MCD estimate of location ($ \mu_{mcd}$) and scatter ($S_{mcd}$) for your data. You're on the way with the CovMCD(x) function in R. Note that this function implements the FastMCD algorithm by default which repeatedly takes subsamples of your data (each of size denoted by $h$) to make estimates. The number of subsamples to take by default in this function is 500 which explains the reason you're seeing 500 there. The function is not taking 500 observations out of your data. It is instead taking subsamples of size $n_{subsample} = h$ repeatedly to make estimates. 500 of these subsamples will be taken to make estimates. Check Peter J. Rousseeuw & Katrien Van Driessen (1999) A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41:2, 212-223 for more details.

B. Get the robust distances for each row or obsersavtion using the formula in item 3 above.

C. Compare each distance $rd(x_i)$ to $\sqrt{\chi^2_{p,.975}}$ and declare $x_i$ an outlier if $rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$.

And example R code is provided below:

require(rrcov)
data(hbk)
mcd <- rrcov::CovMcd(hbk[,1:3]) # use only first three columns  
# get mcd estimate of location
mean_mcd <- mcd@raw.center
# get mcd estimate scatter
cov_mcd <- mcd@raw.cov

# get inverse of scatter
cov_mcd_inv <- solve(cov_mcd)

# compute distances

# compute the robust distance
robust_dist <- apply(hbk[,1:3], 1, function(x){
  x <- (x - mean_mcd)
  dist <- sqrt((t(x)  %*% cov_mcd_inv %*% x))
  return(dist)
})

# set cutoff using chi square distribution
threshold <- sqrt(qchisq(p = 0.975, df = ncol(hbk[,1:3]))) # df = no of columns

# find outliers
outliers <-  which(robust_dist >= threshold) # gives the row numbers of outliers

PCA – Robust PCA vs. Robust Mahalanobis Distance for Outlier Detection

This paper compares some methods in this area. They refer to the Robust PCA approach you linked to as "PCP" (principal components pursuit) and the family of methods you linked to for robust covariance estimation as M-estimators.

They argue that

PCP is designed for uniformly corrupted coordinates of data, instead of corrupted data points (i.e., outliers), therefore, the comparison with PCP is somewhat unfair for this kind of data

and show that PCP (aka robust PCA) can fail for outlier detection in some cases.

They also talk about three kinds of "enemies of subspace recovery," i.e. different kinds of outliers, and which kinds of methods might do well for dealing with each one. Comparing your own outliers with the three kinds of "enemies" discussed here might help you pick an approach.

Best Answer

Related Solutions

Solved – Multivariate Outlier Detection with Robust Mahalanobis

PCA – Robust PCA vs. Robust Mahalanobis Distance for Outlier Detection

Related Question