Solved – Mahalanobis distance as measure of dissimilarity between strings (sequences)

distance-functions

I'm doing some research about methods for distance-based comparison of composition of biological sequences (genes, proteins).

Suppose I have two strings (named X and Y) of different lengths, but from a finite alphabet (A, C, T, G):

X = 'ACGT'
Y = 'ACGTA'

The difference between two strings can be quantified by calculating distance between their transition matrices. To do so, we can calculate how many times each letter from the alphabet is present in each string. We obtain two vectors representing letter counts for the sequences:

x = [1,1,1,1] 
y = [2,1,1,1]

Then I can calculate Euclidean distance:

d(x,y) = [(1-2)^2 + (1-1)^2 + (1-1)^2 + (1-1)^2]^0.5 = 1^0.5 = 1

I can't figure out how to calculate the mahalanobis distance. I would be grateful if someone could employ my example and show me how to calculate the mahalanobis distance.

Best Answer

Manual calculation of Mahalanobis Distance is simple but unfortunately a bit lengthy:

>>> # here's the formula i'll use to calculate M/D:
>>> md = (x - y) * LA.inv(R) * (x - y).T

In other words, Mahalanobis distance is the difference (of the 2 data vecctors) multiplied by the inverse of the covariance matrix multiplied by the transpose of the difference (of the same 2 vectors, x & y)

>>> # your 2 data points whose Mahalanobis distance you wish to calculate
>>> x = NP.mat("1 1 1 1")
>>> y = NP.mat("2 1 1 1")

>>> # not enough data supplied in the OP to properly calculate covariance matrix,
>>> # so we'll make some up--a 10 rows of data points of same dimension as x & y
>>> #partition your data into classes (e.g., if you have two classes,
>>> # put all class I data points in one array & all class II points in another)

>>> # for instance pretend 'a' below is the matrix of of your data points
>>> (like x & y) all assigned to the same class
>>> a = NP.random.randint(0, 5, 40).reshape(10, 4)
>>> a
  array([[1, 2, 2, 1],
         [3, 0, 4, 4],
         [2, 3, 1, 1],
         [1, 0, 3, 0],
         [4, 4, 3, 2],
         [4, 0, 0, 4],
         [4, 4, 0, 1],
         [4, 1, 2, 1],
         [4, 0, 3, 4],
         [2, 2, 4, 1]])

>>> # "mean center" this data prior to calculating covariance matrix
>>> mx = NP.mean(a, axis=0)
>>> a1 = a - mx

>>> # sanity check:
>>> NP.mean(a1, axis=0)
  array([ 0., -0., -0.,  0.])

>>> # calculate coveriance matrix of the mean-centered data matrix, a1
>>> R = NP.corrcoef(a1, rowvar=0)
>>> R
  array([[ 1.   ,  0.084, -0.281,  0.561],
         [ 0.084,  1.   , -0.284, -0.461],
         [-0.281, -0.284,  1.   ,  0.059],
         [ 0.561, -0.461,  0.059,  1.   ]])

>>> # quick sanity check(s): 
>>> # (i) is cov matrix n x n? and a; and
>>> # (ii) main diagonal consists of all '1's 
>>> # (because a number and itself of course have perfect covariance)

>>> # repeat those 2 steps (mean center + calculate covariance matrix)
>>> # for the other data matrices (comprised of data points 
>>> # in the remaining classes).

>>> # next calculate 'pooled covariance matrix' by taking weighted average 
>>> of these covariance marices (weighted according to number of rows in 
>>> # the original data matrices used to calculate the covariance matrices

>>> # convert element-wise NumPy arrays to linear algebra matrices
>>> R = NP.matrix(R)    

>>> # calculate the inverse of the weighted average covariance matrix
>>> RI = LA.inv(R)

>>> # now just plug the values into the Mahalanobis code i recited near the top
>>> # we'll do it step-wise so we can see intermediate results:
>>> # another sanity check: we are calculating a distance obviously so the final
>>> # should be a 1 x 1 matrix (scalar)

>>> xy_diff = x - y
>>> a = xy_diff * RI
>>> a
 matrix([[-2.034,  0.737, -0.452,  1.508]])

>>> b = xy_diff.T
>>> a * b
  matrix([[2.043]])     # the Mahalanobis distance for the 2 vectors, x & y

Other (faster) ways to calculate Mahalanobis distance:

The excellent matrix computation mega-library for Python, SciPy, actually has a module "spatial" which inclues a good Mahalanobis function. I can recommend this highly (both the library and the function); I have used this function many times and on several ocassions i cross-verified the results with those from other libraries.

Or you can use R, which has a bult-in function of the same name to calculate M/D, mahalanobis. A concise and useful help page for this function can be accessed by typing in the R interpreter:

?mahalanobis

Finally, i am quite sure that other formulations of Mahalanobis Distance can be found in various R libraries, particularly in some of the libraries in the Bioconductor Project which contains a huge set of R libraries, or "Packages", for the quantitative study of life sciences) then you can calculate Mahalanobis distance using a built-in function of the same name ("mahalanobis.") The reason i mention this is that these domain-specific formulations are likely to have helper functions to save time on the tedious predicate steps e.g., mean-centering and calculating the weighted average covariance matrix.

Related Solutions

Mahalanobis Distance – Calculating Mahalanobis Distance Between Two Bivariate Distributions with Different Covariances

There are many notions of distance between probability distributions. Which one to use depends on your goals. Total variation distance is a natural way of measuring overlap between distributions. If you are working with multivariate Normals, the Kullback-Leibler Divergence is mathematically convenient. Though it is not actually a distance (as it fails to be symmetric and fails to obey the triangle inequality), it upper bounds the total variation distance — see Pinsker’s Inequality.

Solved – Mahalanobis distance for vector-classification

Assuming there are some differences between the covariance matrices of the $X_i$ classified as $+1$ and those classified as $-1$, you could do the following:

Calculate the covariances for the two sets of $X_i$. I'll label them $\Sigma_{+}$ and $\Sigma_{-}$.
For all $i$ in the +-set: $d_{test,i} = \sqrt{(x_{test}-x_i)^{\text{T}}\Sigma_{+}^{-1}(x_{test}-x_i)}$. Similarly for the --set, just using $\Sigma_{-}^{-1}$ instead, obviously.
Take the $i$ associated with the minimum $d_{test,i}$ as your closest history-set vector.

The $d_{test,i}$ are the Mahalanobis distances between $X_{test}$ and the $X_i$.

Sample code in R for a single covariance matrix:

# Create sample history matrix with 100 entries 
X <- matrix(rnorm(1000),10,100)
# Create sample test matrix
Xtest <- rnorm(10)

# Calculate the Mahalanobis distances
Sigma <- cov(t(X))
SInv <- solve(Sigma)

di <- rep(0, ncol(X))
for (i in 1:length(di)) {
  di[i] <- sqrt(t(Xtest-X[,i]) %*% SInv %*% (Xtest-X[,i]))
}

which.min(di)

Best Answer

Related Solutions

Mahalanobis Distance – Calculating Mahalanobis Distance Between Two Bivariate Distributions with Different Covariances

Solved – Mahalanobis distance for vector-classification

Related Question