Solved – Similarity measure between two variables

correlationsimilarities

I want to find the similarity factor (some numerical value) between two variables.

Example:

row 1: 5.1, 3.5, 1.4, 0.2

rows 2: 4.9, 3.1, 1.5, 0.1

How to find similarity factor between these two variables?

There is correlation but other than that what are the options?

This data is taken from the Iris flower dataset. We are performing some data clustering on this data set. Our task is find/measure the similarity between rows without using correlation. So, other than correlation, what are the options?

Best Answer

Qualifications

It so happens that in the Iris data set the rows (as is this data set is usually presented) are values on four variables, all with the same dimensions and units. However, I will not assume reference to this specific data set.

For more on that data set, one starting point is

What aspects of the Iris data set make it so successful ...

Moreover, your question title asks about similarity between variables (features, attributes, etc.), but the specific details hint at an interest in similarity between observations (items, cases, etc.). I will focus on measuring similarity of variables, particularly given your specific mention of correlation, which reflects a common misunderstanding of correlation.

Note that what appears as rows and what appears as columns in data is a matter of convention or convenience and is otherwise not fundamental. In other words, a data set can always be transposed.

Correlation does not measure similarity

Contrary to your statement, correlation does not measure similarity if similarity means that the highest value of a measure is achieved if and only if all values are identical. (Any one can reverse the game and define a measure and then give it some name from their language as a label. Examples abound in all sciences.)

The first argument against that is that correlation can be applied to variables which are in quite different units, so that it is then nonsensical to ask whether values are similar. So, if the variables are rainfall and wheat yield, the units of measurement are different; correlation can be calculated so long as there are paired values, but it makes no sense to ask whether 20 mm rainfall is similar to 20 kg/ha wheat yield.

The second argument against that is that you can achieve perfect correlations with value $1$ between $y$ and $x$ so long as $y = a + bx$ for any $a$ and any positive $b$. So $10^\text{anything} y$ and $y$ have correlation 1 but their values are similar only if the exponent is close to 0.

Similarity can be defined in many ways: you need to choose

To your question: you need to firm up quite what you mean by similarity, but for variables $x, y$ on the same measurement scale, summary measures of similarity based on the differences $x - y$ could all make sense; measures based on the ratios $y/x$ or $x/y$ could all make sense so long as all values are of the same sign and not zero; measures based on comparing $\log y$ and $\log x$ could make sense so long as all values are positive. Further, you have to decide whether you want your measure to have the same units as the original variable, or to be free of units so that the similarity between different variables can be compared.

For the Iris data all these conditions are satisfied.

Indeed, the point of the exercise may well be to underline that the vague concept of similarity can be made precise in many different ways.