I am searching some documents and examples related multivariate outlier detection with robust (minimum covariance estimation) mahalanobis distance. I have 6 variables and want to plot them to show outliers also. Do you have any sources?
Here are the codes, but I think something going wrong. Because I have over 2 million cases it has taken only n=500.
> CovMcd(new)
Call:
CovMcd(x = new)
-> Method: Fast MCD(alpha=0.5 ==> h=1342195); nsamp = 500; (n,k)mini = (300,5)
Robust Estimate of Location:
logdgr logtr lph lpm lpr
4.391 2.956 -2.722 -4.802 -4.802
Robust Estimate of Covariance:
logdgr logtr lph lpm lpr
logdeger 1.0183 0.8981 0.6427 0.7112 0.7113
loghacim 0.8981 1.0173 0.9613 0.9539 0.9541
lph 0.6427 0.9613 1.6921 1.3770 1.3772
lpd 0.7112 0.9539 1.3770 1.3085 1.3087
lpr 0.7113 0.9541 1.3772 1.3087 1.3089
> summary(mcd)
Call:
CovMcd(x = data)
Robust Estimate of Location:
[1] 3.5 2.0
Robust Estimate of Covariance:
[,1] [,2]
[1,] 3.5 0.8
[2,] 0.8 0.8
Eigenvalues of covariance matrix:
[1] 3.7192 0.5808
Robust Distances:
[1] 2.0833 0.8333 2.0833 2.0833 0.8333 2.0833
> mest <- CovMest(new)
> show(mcd)
Call:
CovMcd(x = data)
-> Method: Fast MCD(alpha=0.5 ==> h=4); nsamp = 500; (n,k)mini = (300,5)
Robust Estimate of Location:
[1] 3.5 2.0
Robust Estimate of Covariance:
[,1] [,2]
[1,] 3.5 0.8
[2,] 0.8 0.8
Best Answer
What you are trying to do stems from the following basic idea:
Compare each $rd(x_{i})$ to $\sqrt{\chi^2_{p,.975}}$. Declare $x_i$ an outlier if $$rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$$
The robust distances are given by: $$rd(x_i) = \sqrt{(x_i - \mu_{mcd} )'S_{mcd}^{-1} (x_i - \mu_{mcd} )}$$ where $ \mu_{mcd}$ and $S_{mcd}$ are the robust MCD estimates of location (mean vector) and scatter (covariance matrix) respectively.
So therefore, what you need to do is:
A. Get the robust MCD estimate of location ($ \mu_{mcd}$) and scatter ($S_{mcd}$) for your data. You're on the way with the
CovMCD(x)
function in R. Note that this function implements the FastMCD algorithm by default which repeatedly takes subsamples of your data (each of size denoted by $h$) to make estimates. The number of subsamples to take by default in this function is 500 which explains the reason you're seeing 500 there. The function is not taking 500 observations out of your data. It is instead taking subsamples of size $n_{subsample} = h$ repeatedly to make estimates. 500 of these subsamples will be taken to make estimates. Check Peter J. Rousseeuw & Katrien Van Driessen (1999) A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41:2, 212-223 for more details.B. Get the robust distances for each row or obsersavtion using the formula in item 3 above.
C. Compare each distance $rd(x_i)$ to $\sqrt{\chi^2_{p,.975}}$ and declare $x_i$ an outlier if $rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$.
And example R code is provided below: