Multivariate Analysis – Understanding Mahalanobis Distance Distribution of Multivariate Normal Points

I'm trying to understand the properties of Mahalanobis distance of multivariate random points (my final goal is to use Mahalanobis distance for outlier detection). My calculations are in python.
I miss some basics here and will be glad if someone will explain me my mistake.

Here is my code:

First I generate 1000 multivariate (2D) random points around the origin

#imports and definitions
import numpy as np
import scipy.stats as stats
import scipy.spatial.distance as distance
chi2 = stats.chi2
np.random.seed(111)


#covariance matrix: X and Y are normally distributed with std of 1
#and are independent one of another
covCircle = array([[1, 0.], [0., 1.]])
circle = multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0]
mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T)
d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points
d2 = d ** 2 #MD squared

At this point I expect that d2 (Mahalanobis distance values squared) to follow chi2 distribution with 2 degrees of freedom.
I would like to calculate the corresponding probability distribution function (PDF) values and use them for outlier detection.
To illustrate my problem, plot the calculated PDF values for values between 0 and 10 for chi-squared distribution with 2
degrees of freedom and with location and scale values equal to the mean and standard deviation of d2, respectively:

theMean = mean(d2)
theStd = std(d2)
x = linspace(0, 10)
pdf = chi2.pdf(x, 2, theMean, theStd)
plot(x, pdf, '-b')

Here is the plot:

enter image description here
As we can see, the PDF values that correspond to X values below ~2.04 are neglegible low, suggesting that there are no points that are closer than 2.04 from the origin, which is obviously not true. What am I missing?

EDIT
Why am I using theMean and theStd as arguments to the pdf function? As far as I understand, the distances are expected to have chi2 distribution with the given mean and stdev. That is why I use these as arguments to PDF. From a comment to this question I understand that this is not true. So what are the correct parameters of the theoretical chi2 distribution?

#imports and definitions import numpy as np import scipy.stats as stats import scipy.spatial.distance as distance import matplotlib.pyplot as plt chi2 = stats.chi2 np.random.seed(111) #covariance matrix: X and Y are normally distributed with std of 1 #and are independent one of another covCircle = np.array([[1, 0.], [0., 1.]]) circle = np.random.multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0] mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T) d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points d2 = d ** 2 #MD squared degrees_of_freedom = 2 x = range( len( d2 )) plt.subplot(111) plt.scatter( x, d2 ) plt.hlines( chi2.ppf(0.95, degrees_of_freedom), 0, len(d2), label ="95% $\chi^2$ quantile", linestyles = "solid" ) plt.hlines( chi2.ppf(0.975, degrees_of_freedom), 0, len(d2), label ="97.5% $\chi^2$ quantile", linestyles="dashed" ) plt.hlines( chi2.ppf(0.99, degrees_of_freedom), 0, len(d2), label ="99.5% $\chi^2$ quantile", linestyles = "dotted" ) plt.legend() plt.ylabel("recorded value") plt.xlabel("observation") plt.title( 'Detection of outliers at different $\chi^2$ quantiles' ) plt.show()

Best Answer

Your call to stats.chi2 is indeed incorrect. When you map your data using the mahalanobis distance, it is theoretically $\chi^2_2$ data, so you do not need to play with the loc, scale parameters in the stats.chi2 function (but do keep df=2, like you did). Here's my modified code, plus a pretty visualization of outlier detection.

I should mention this is somewhat cheating. We used the true mean and true covariance matrix to find outliers. In practice, we only have estimates (which depended on the outiers themselves!), hence we require more robust mean and covariance estimates, see this great paper.

The data that is above the 99.5% line is shown in red below $The $\chi^2$ test for outliers$

observed data. Red points are considered outliers

Best Answer

Related Solutions

Multivariate Distributions – Measuring the Distance Between Two Multivariate Distributions

Multivariate Analysis – Calculating Mahalanobis Distance to a Multivariate Distribution with Few Sample Points

Related Question