I'm doing some research about methods for distance-based comparison of composition of biological sequences (genes, proteins).
Suppose I have two strings (named X and Y) of different lengths, but from a finite alphabet (A, C, T, G):
X = 'ACGT'
Y = 'ACGTA'
The difference between two strings can be quantified by calculating distance between their transition matrices. To do so, we can calculate how many times each letter from the alphabet is present in each string. We obtain two vectors representing letter counts for the sequences:
x = [1,1,1,1]
y = [2,1,1,1]
Then I can calculate Euclidean distance:
d(x,y) = [(1-2)^2 + (1-1)^2 + (1-1)^2 + (1-1)^2]^0.5 = 1^0.5 = 1
I can't figure out how to calculate the mahalanobis distance. I would be grateful if someone could employ my example and show me how to calculate the mahalanobis distance.
Best Answer
Manual calculation of Mahalanobis Distance is simple but unfortunately a bit lengthy:
In other words, Mahalanobis distance is the difference (of the 2 data vecctors) multiplied by the inverse of the covariance matrix multiplied by the transpose of the difference (of the same 2 vectors, x & y)
Other (faster) ways to calculate Mahalanobis distance:
The excellent matrix computation mega-library for Python, SciPy, actually has a module "spatial" which inclues a good Mahalanobis function. I can recommend this highly (both the library and the function); I have used this function many times and on several ocassions i cross-verified the results with those from other libraries.
Or you can use R, which has a bult-in function of the same name to calculate M/D, mahalanobis. A concise and useful help page for this function can be accessed by typing in the R interpreter:
Finally, i am quite sure that other formulations of Mahalanobis Distance can be found in various R libraries, particularly in some of the libraries in the Bioconductor Project which contains a huge set of R libraries, or "Packages", for the quantitative study of life sciences) then you can calculate Mahalanobis distance using a built-in function of the same name ("mahalanobis.") The reason i mention this is that these domain-specific formulations are likely to have helper functions to save time on the tedious predicate steps e.g., mean-centering and calculating the weighted average covariance matrix.