Covariance – Robust Covariance and OGK Outlier Detection in Multivariate Analysis

covariance-matrixmachine learningmultivariate analysisoutliersrobust

I'm calculating the robust covariance of a data set in order to use mahalanobis distance for outlier detection. There are few methods to calculate the covariance in the equation. Using the Fast-MCD method I got non-deterministic results, which led me to use the OGK method (Orthogonalized Gnanadesikan Kettenring) that produced deterministic results.
I'm trying to understand the difference between Fast-MCD and OGK, and specifically how the OGK method works?

Best Answer

To justify the OGK, you need to see the following chain:

  • Outlier detection and robust estimation are essentially equivalent problems,
  • At its core, outlier detection is related to the problem of (center-outward) ranking of the data,
  • Ranking is much more complicated when you have several variables than if you only have one.

For these reasons, uni-variate robust estimators are much simpler than multivariate ones (including computationally).

The idea of the OGK is to construct a multivariate estimator of scatter (like the classical covariance matrix but robust to outliers) by combining $O(p^2)$ univariate outlier-robust estimates of scale (where $p$ is the number of variables in your data).


Intermezzo:

For the classical covariance matrix $\pmb V(\pmb X)$ of a data matrix $\bf X$ (with colunms $\{X_i\}_{i=1}^p$), it holds that:

$$\pmb V_{ii}(\pmb X) = s^2(X_i)$$

--where $s^2(X_i)$ is the classical univariate estimator of variance applied to the vector $X_i$-- and

$$\pmb V_{ij}(\pmb X) = \frac{1}{4}\left(s^2\left(X_i+X_j\right)-s^2\left(X_i-X_j\right)\right)$$


The OGK is based on two ideas.

Use the equations above to construct $\pmb C$ (a robust scatter matrix) by substituting $s^2()$ in the equations above by $m^2()$, a robust estimator of scale (for example the square of the MAD). This is the first idea.

The resulting matrix is only guaranteed positive semi-definite when $m()=s()$. So, in general, $\pmb C$ is not positive semi definite. This is an annoying flaw for an estimator of scatter. Therefore we need to apply some additional step, in the case of the OGK called orthogonalization step, to $\pmb C$ to find a positive semi-definite matrix that is, in some sense, 'close to $\pmb C$' --this is the second idea--:

  • Rescale your data, columnswise: $\pmb Y = \pmb D^{-1}\pmb X$ (where $\pmb D = \text{diag}(m^2(X_1),\ldots,m^2(X_p)$)
  • project the rescaled data on $\pmb E$, the eigenvector of $\pmb C(\pmb Y)$ (the robust scatter matrix computed using the formula above applied to the on rescaled data $Y$), i.e. $\pmb T=\pmb E^{\top}\pmb Y$;
  • compute a diagonal matrix of 'robust variances' $\pmb\Lambda=\text{diag}(m^2(T_1),\ldots,m^2(T_p))$;
  • set $\pmb\mu(\pmb X)=\pmb A\pmb m$ where $\pmb m=(m(T_1),\ldots,m(T_p))$ and compute the positive definite matrix $\pmb\varSigma(\pmb X)=\pmb A \pmb\Lambda \pmb A^T$, where $\pmb A= \pmb D\pmb E$.

Then, $\pmb\mu(\pmb X)$ and $\pmb\varSigma(\pmb X)$ are called respectively the raw OGK estimates of location and scatter. $\pmb\varSigma(\pmb X)$ is now guaranteed to be positive semi-definite. Moreover, it is in some sense close to $\pmb C(\pmb X)=\pmb D\pmb C(\pmb Y)\pmb D$.

The raw OGK estimates of location and scatter trivially inherit the breakdown point (a measure of robustness of an estimator to the presence of outliers in the data) of the measure of scale used in the computation of $\pmb C$ (for example the breakdown point is about 50%, which is best, when $m()$ is the MAD).

You asked for a comparison with another 50% breakdown point algorithm to estimate the scatter matrix robustly, FastMCD. Contrary to FastMCD, OGK is a deterministic algorithm. Also contrary to FastMCD, OGK is not affine equivariant.

Related Question