Solved – Similarity measure between two variables

correlationsimilarities

I want to find the similarity factor (some numerical value) between two variables.

Example:

row 1: 5.1, 3.5, 1.4, 0.2

rows 2: 4.9, 3.1, 1.5, 0.1

How to find similarity factor between these two variables?

There is correlation but other than that what are the options?

This data is taken from the Iris flower dataset. We are performing some data clustering on this data set. Our task is find/measure the similarity between rows without using correlation. So, other than correlation, what are the options?

Best Answer

Qualifications

It so happens that in the Iris data set the rows (as is this data set is usually presented) are values on four variables, all with the same dimensions and units. However, I will not assume reference to this specific data set.

For more on that data set, one starting point is

What aspects of the Iris data set make it so successful ...

Moreover, your question title asks about similarity between variables (features, attributes, etc.), but the specific details hint at an interest in similarity between observations (items, cases, etc.). I will focus on measuring similarity of variables, particularly given your specific mention of correlation, which reflects a common misunderstanding of correlation.

Note that what appears as rows and what appears as columns in data is a matter of convention or convenience and is otherwise not fundamental. In other words, a data set can always be transposed.

Correlation does not measure similarity

Contrary to your statement, correlation does not measure similarity if similarity means that the highest value of a measure is achieved if and only if all values are identical. (Any one can reverse the game and define a measure and then give it some name from their language as a label. Examples abound in all sciences.)

The first argument against that is that correlation can be applied to variables which are in quite different units, so that it is then nonsensical to ask whether values are similar. So, if the variables are rainfall and wheat yield, the units of measurement are different; correlation can be calculated so long as there are paired values, but it makes no sense to ask whether 20 mm rainfall is similar to 20 kg/ha wheat yield.

The second argument against that is that you can achieve perfect correlations with value $1$ between $y$ and $x$ so long as $y = a + bx$ for any $a$ and any positive $b$. So $10^\text{anything} y$ and $y$ have correlation 1 but their values are similar only if the exponent is close to 0.

Similarity can be defined in many ways: you need to choose

To your question: you need to firm up quite what you mean by similarity, but for variables $x, y$ on the same measurement scale, summary measures of similarity based on the differences $x - y$ could all make sense; measures based on the ratios $y/x$ or $x/y$ could all make sense so long as all values are of the same sign and not zero; measures based on comparing $\log y$ and $\log x$ could make sense so long as all values are positive. Further, you have to decide whether you want your measure to have the same units as the original variable, or to be free of units so that the similarity between different variables can be compared.

For the Iris data all these conditions are satisfied.

Indeed, the point of the exercise may well be to underline that the vague concept of similarity can be made precise in many different ways.

Related Solutions

Solved – A reliable measure of series similarity – correlation just doesn’t cut it for me

The two most common methods (in my experience) for comparing signals are the correlation and the mean squared error. Informally, if you imagine your signal as a point in some N-dimensional space (this tends to be easier if you imagine them as 3D points) then the correlation measures whether the points are in the same direction (from the "origin") and the mean squared error measures whether the points are in the same place (independent of the origin as long as both signals have the same origin). Which works better depends somewhat on the types of signal and noise in your system.

The MSE appears to be roughly equivalent to your example:

mse = 0;
for( int i=0; i<N; ++i )
    mse += (x[i]-y[i])*(x[i]-y[i]);
mse /= N;

note however that this isn't really Pearson correlation, which would be more like

xx = 0;
xy = 0;
yy = 0;

for( int i=0; i<N; ++i )
{
    xx += (x[i]-x_mean)*(x[i]-x_mean);
    xy += (x[i]-x_mean)*(y[i]-y_mean);
    yy += (y[i]-y_mean)*(y[i]-y_mean);
}

ppmcc = xy/std::sqrt(xx*yy);

given the signal means x_mean and y_mean. This is fairly close to the pure correlation:

corr = 0;
for( int i=0; i<N; ++i )
    corr += x[i]*y[i];

however, I think the Pearson correlation will be more robust when the signals have a strong DC component (because the mean is subtracted) and are normalised, so a scaling in one of the signals will not cause a proportional increase in the correlation.

Finally, if the particular example in your question is a problem then you could also consider the mean absolute error (L1 norm):

mae = 0;
for( int i=0; i<N; ++i )
    mae += std::abs(x[i]-y[i]);
mae /= N;

I'm aware of all three approaches being used in various signal and image processing applications, without knowing more about your particular application I couldn't say what would be likely to work best. I would note that the MAE and the MSE are less sensitive to exactly how the data is presented to them, but if the mean error is not really the metric you're interested in then they won't give you the results you're looking for. The correlation approaches can be better if you're more interested in the "direction" of your signal than the actual values involved, however it is more sensitive to how the data are presented and almost certainly requires some centring and normalisation to give the results you expect.

You might want to look up Phase Correlation, Cross Correlation, Normalised Correlation and Matched Filters. Most of these are used to match some sub-signal in a larger signal with some unknown time lag, but in your case you could just use the value they give for zero time lag if you know there is no lag between the two signals.

Solved – How to find similarities between time series

What you have is K (5) Groups where you have a dependent (water temp) and an independent series(air temp). This problems is called Pooled Cross-Sectional Time Series Analysis. Construct a separate Transfer Function (ARMAX) model for each of the K groups. Identify a common model (outlier resistant) that would be appropriate. Estimate that model globally using all of the data and then perform an F test to test the hypothesis of a common set of parameters. Upon finding a statistically significant F value examine the coefficients to determine which groups (of the K) that are similar. My current research has been to develop an automatic test for this and we have it operational in a current Beta Version of AUTOBOX (http://www.autobox.com). I would be glad to demonstrate this for you, please post your data. Upon finding out how AUTOBOX conducts this test you might be able to program it yourself or at least have a "destination". Hope this helps.

ADDITIONAL COMMENTS USING KATE'S DATA:

I took the first 4 Groups ( depth of .5,4,6 and 8) and used the first 52 values to construct this example analysis. Following are 4 graphs depicting the Y (water temp) and the X (Air temp) over time for the 4 depths; depth1 ; enter image description here ; depth 2; ; depth 3; and depth 4; . An analysis of the within relationship of Y versus X yielded a typical model of the form . I elected to add another AR term to the noise for purposes of a more general expression. All four examples are significantly influenced by anomalies so one might argue/suggest that one should continue with outlier-adjusted series or intervention-scrubbed series i.e. "cleansed series". For presentation purposes here this was not done. We now proceeded to estimate the "typical model" for each of the 4 data sets AND for the composite. This yielded enter image description here . The F test is simply the Chow Test for constant parameters http://en.wikipedia.org/wiki/Chow_test which yielded a significant F. The Chow test simply sums the error sos from each of the 4 cases (each with 52 values) and compares it to the error sos for the composite (208 values). Hope this helps.

Best Answer

Related Solutions

Solved – A reliable measure of series similarity – correlation just doesn’t cut it for me

Solved – How to find similarities between time series

Related Question