Solved – Similarity scoring to compare multi-dimensional datasets

computational-statisticsdistance-functionseuclideanjaccard-similaritysimilarities

I am trying to come up with a mechanism of scoring a set of multidimensional datasets based on a similarity with an ideal dataset. Each dataset will all have the same dimensions along with the ideal.

Thus dataset $I$ is the ideal with $m$ rows and $n$ columns, I would like to be able to compare datasets $D^1, D^2, D^3, \dots$ based on how D1(i,j) compares with $I_{i,j}$ and come up with a similarity measure that I could use for scoring. I am thinking that a Euclidean distance measure of
$$
\sum_{i,j} (D^k_{i,j}-I_{i,j})^2\,,
$$
for matching keys may be the way to go, but I would just like some suggestions as to how to approach this problem. I have also heard about the Mahalanobis distance approach. Any suggestions would be welcome.

Thanks in advance.

Best Answer

There are many different ways to calculate the distance of datasets, but at the beginning it might be hard to get an overview, because many different names are used. It just depends on how rigorous your math needs to be (e.g., look for "metric", "norm" and "distance").

If you just need distances in Euclidean space, have a look at the Wikipedia article:

1-norm distance = $\sum_{i=1}^n \left| x_i - y_i \right|$

2-norm distance = $\left( \sum_{i=1}^n \left| x_i - y_i \right|^2 \right)^{1/2}$

p-norm distance = $\left( \sum_{i=1}^n \left| x_i - y_i \right|^p \right)^{1/p}$

$\infty$-norm distance = $\lim_{p \to \infty} \left( \sum_{i=1}^n \left| x_i - y_i \right|^p \right)^{1/p} > = \max \left(|x_1 - y_1|, |x_2 - y_2|, \ldots, |x_n - y_n| \right)$.

What exactly you'll use depends on your needs, all these distances have different meanings: the $L_1$-norm for example is the so-called "taxi-cab" distance, the $L_2$-norm is the Euclidean distance, etc. Maybe have a look at a statistics or machine learning text book to read up on the differences.

Note that in general you want to normalize your distance, so that it doesn't depend on the number of data points. Therefore, you should calculate the mean of these distances over the whole dataset. This means your $$ \sum_{i,j} (D^k_{i,j}-I_{i,j})^2 $$ should in fact be $$ \frac{1}{N}\sum_{i,j} (D^k_{i,j}-I_{i,j})^2\,, $$ where $N$ is the number of data points (either $m$ or $n$, depending on your dataset).

The Mahalanobis distance can only be used if your dataset contains Gaussian distributions instead of just points. Then, the Mahalanobis distance is the $L_2$-norm, weighted by the precision of the distribution -- but this leads too far as I guess you don't need it.

Related Question