Solved – Similarity scoring to compare multi-dimensional datasets

computational-statisticsdistance-functionseuclideanjaccard-similaritysimilarities

I am trying to come up with a mechanism of scoring a set of multidimensional datasets based on a similarity with an ideal dataset. Each dataset will all have the same dimensions along with the ideal.

Thus dataset $I$ is the ideal with $m$ rows and $n$ columns, I would like to be able to compare datasets $D^1, D^2, D^3, \dots$ based on how D1(i,j) compares with $I_{i,j}$ and come up with a similarity measure that I could use for scoring. I am thinking that a Euclidean distance measure of
$$
\sum_{i,j} (D^k_{i,j}-I_{i,j})^2\,,
$$
for matching keys may be the way to go, but I would just like some suggestions as to how to approach this problem. I have also heard about the Mahalanobis distance approach. Any suggestions would be welcome.

Thanks in advance.

Best Answer

There are many different ways to calculate the distance of datasets, but at the beginning it might be hard to get an overview, because many different names are used. It just depends on how rigorous your math needs to be (e.g., look for "metric", "norm" and "distance").

If you just need distances in Euclidean space, have a look at the Wikipedia article:

1-norm distance = $\sum_{i=1}^n \left| x_i - y_i \right|$

2-norm distance = $\left( \sum_{i=1}^n \left| x_i - y_i \right|^2 \right)^{1/2}$

p-norm distance = $\left( \sum_{i=1}^n \left| x_i - y_i \right|^p \right)^{1/p}$

$\infty$-norm distance = $\lim_{p \to \infty} \left( \sum_{i=1}^n \left| x_i - y_i \right|^p \right)^{1/p} > = \max \left(|x_1 - y_1|, |x_2 - y_2|, \ldots, |x_n - y_n| \right)$.

What exactly you'll use depends on your needs, all these distances have different meanings: the $L_1$-norm for example is the so-called "taxi-cab" distance, the $L_2$-norm is the Euclidean distance, etc. Maybe have a look at a statistics or machine learning text book to read up on the differences.

Note that in general you want to normalize your distance, so that it doesn't depend on the number of data points. Therefore, you should calculate the mean of these distances over the whole dataset. This means your $$ \sum_{i,j} (D^k_{i,j}-I_{i,j})^2 $$ should in fact be $$ \frac{1}{N}\sum_{i,j} (D^k_{i,j}-I_{i,j})^2\,, $$ where $N$ is the number of data points (either $m$ or $n$, depending on your dataset).

The Mahalanobis distance can only be used if your dataset contains Gaussian distributions instead of just points. Then, the Mahalanobis distance is the $L_2$-norm, weighted by the precision of the distribution -- but this leads too far as I guess you don't need it.

Related Solutions

Solved – How to compute a measure of distance between sites with continuous variables

I tried this and got different results (as expected) from the distances of a data frame and its transpose:

library(ade4)
x1 <- rnorm(10, 2, 1)
x2 <- rnorm(10,1,1)
dframe <- cbind(x1,x2)
dist1 <- dist.quant(dframe, 1, diag = TRUE, upper = TRUE)
dist1
dist2 <- dist.quant(t(dframe),1, diag = TRUE, upper = TRUE)
dist2

dist2 gives a single distance (between x1 and x2). dist1 gives a $10\times10$ matrix (since I put upper = TRUE and diagonal = TRUE)

Cosine Similarity – How Higher Dimensions Impact Performance Compared to Euclidean Distance

This makes sense to me. When you add more dimensions, you are, in some sense, giving more ways for the points to be far apart. On the number line, $x_0=0$ and $x_1 = 1$ are only $1$ unit apart. However, if we add a second dimension, $(x_0, y_0) = (0, 0)$ and $(x_1, y_1) = (1, 100)$ are a lot more than one unit apart. However, the angular measurement is always restricted to just the two dimensions spanned by your two points.

We have two questions on here that might interest you.

Why is Euclidean distance not a good metric in high dimensions?

Square loss for "big data"

EDIT

You can decide if this makes you like or dislike cosine distance, but consider the points $(0, 1)\in\mathbb R^2$ and $(1, 0)\in\mathbb R^2$. They have the same cosine distance as $(0, 1)$ and $(2, 0)$, but the Euclidean distances are different.

Best Answer

Related Solutions

Solved – How to compute a measure of distance between sites with continuous variables

Cosine Similarity – How Higher Dimensions Impact Performance Compared to Euclidean Distance

Related Question