Multivariate Analysis – Understanding Mahalanobis Distance Distribution of Multivariate Normal Points

distance-functionsmultivariate analysisoutlierspython

I'm trying to understand the properties of Mahalanobis distance of multivariate random points (my final goal is to use Mahalanobis distance for outlier detection). My calculations are in python.
I miss some basics here and will be glad if someone will explain me my mistake.

Here is my code:

First I generate 1000 multivariate (2D) random points around the origin

#imports and definitions
import numpy as np
import scipy.stats as stats
import scipy.spatial.distance as distance
chi2 = stats.chi2
np.random.seed(111)


#covariance matrix: X and Y are normally distributed with std of 1
#and are independent one of another
covCircle = array([[1, 0.], [0., 1.]])
circle = multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0]
mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T)
d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points
d2 = d ** 2 #MD squared

At this point I expect that d2 (Mahalanobis distance values squared) to follow chi2 distribution with 2 degrees of freedom.
I would like to calculate the corresponding probability distribution function (PDF) values and use them for outlier detection.
To illustrate my problem, plot the calculated PDF values for values between 0 and 10 for chi-squared distribution with 2
degrees of freedom and with location and scale values equal to the mean and standard deviation of d2, respectively:

theMean = mean(d2)
theStd = std(d2)
x = linspace(0, 10)
pdf = chi2.pdf(x, 2, theMean, theStd)
plot(x, pdf, '-b')

Here is the plot:

enter image description here
As we can see, the PDF values that correspond to X values below ~2.04 are neglegible low, suggesting that there are no points that are closer than 2.04 from the origin, which is obviously not true. What am I missing?

EDIT
Why am I using theMean and theStd as arguments to the pdf function? As far as I understand, the distances are expected to have chi2 distribution with the given mean and stdev. That is why I use these as arguments to PDF. From a comment to this question I understand that this is not true. So what are the correct parameters of the theoretical chi2 distribution?

Best Answer

Your call to stats.chi2 is indeed incorrect. When you map your data using the mahalanobis distance, it is theoretically $\chi^2_2$ data, so you do not need to play with the loc, scale parameters in the stats.chi2 function (but do keep df=2, like you did). Here's my modified code, plus a pretty visualization of outlier detection.

I should mention this is somewhat cheating. We used the true mean and true covariance matrix to find outliers. In practice, we only have estimates (which depended on the outiers themselves!), hence we require more robust mean and covariance estimates, see this great paper.

The data that is above the 99.5% line is shown in red below The $\chi^2$ test for outliers

observed data. Red points are considered outliers

#imports and definitions
import numpy as np
import scipy.stats as stats
import scipy.spatial.distance as distance
import matplotlib.pyplot as plt
chi2 = stats.chi2
np.random.seed(111)


#covariance matrix: X and Y are normally distributed with std of 1
#and are independent one of another
covCircle = np.array([[1, 0.], [0., 1.]])
circle = np.random.multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0]
mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T)
d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points
d2 = d ** 2 #MD squared

degrees_of_freedom = 2

x = range( len( d2 ))

plt.subplot(111)

plt.scatter( x, d2 )

plt.hlines( chi2.ppf(0.95, degrees_of_freedom), 0, len(d2), label ="95% $\chi^2$ quantile", linestyles = "solid" ) 
plt.hlines( chi2.ppf(0.975, degrees_of_freedom), 0, len(d2), label ="97.5% $\chi^2$ quantile", linestyles="dashed" ) 
plt.hlines( chi2.ppf(0.99, degrees_of_freedom), 0, len(d2), label ="99.5% $\chi^2$ quantile", linestyles = "dotted" )

plt.legend()
plt.ylabel("recorded value")
plt.xlabel("observation")
plt.title( 'Detection of outliers at different $\chi^2$ quantiles' )

plt.show()