Solved – A reliable measure of series similarity – correlation just doesn’t cut it for me

correlation

I'm trying to determine a method to compare one particular time series against about 10,000+ reference time series programmatically, and shortlist those reference time series which can be of interest.

The method I was using was Pearson Correlation. For each of the reference time series, I would calculate their correlation coefficients, and then sort the entire list of reference time series in descending order based on the correlation coefficient. I would then visually analyze the top N time series which have the highest correlation coefficients, which should be the best matches to the given time series.

The trouble is that I wasn't getting reliable results. Quite often the series in the top N range didn't visually resemble anything like the given time series. Finally when I read the complete article below I understood why: One can't use correlation alone to determine if two time series are similar.

Anscombe's quartet

Now this is a problem with all matching algorithms which calculate some sort of distance between two time series. For instance, the two groups of time series below can result in the same distance, yet one is obviously a better match than the other.

A => [1, 2, 3, 4, 5, 6, 7, 8,  9]
B1 => [1, 2, 3, 4, 5, 6, 7, 8, 12]
distance = sqrt(0+0+0+0+0+0+0+0+9) = 3
B2 => [0, 3, 2, 5, 4, 7, 6, 9,  8]
distance = sqrt(1+1+1+1+1+1+1+1+1) = 3

So my question is, is there a mathematical formula (like correlation) that can better suit me in these kind of situations? One which does not suffer from the problems mentioned here?

Please feel free to ask for any further clarification or improve the question text if needed. Thanks! =)

EDIT:

Correlation results

@woodchips, @krystian:

The top row shows the last ten bars of USDCHF-Daily ending at the given date. The second row gives the top 3 results of method A used for correlation (explanation will follow). The last row shows the top 3 results of method B. I've used High-Low-Close prices for correlation. The last images in each row are what I'd consider a "good match", the reason being that turning points in the series are more important to me. It's a coincidence that the last rows had the maximum correlation. But you can see in the last row that the second image is a very weak similarity. Still, it manages to sneak into the top 3. This is what disturbs me. Due to this behaviour, I'm forced to visually access each correlation and accept / discard it. The Anscombe's quartet emphasizes too that correlation needs to be visually inspected. That's why I wanted to move away from correlation and explore other mathematical concepts that evaluate series similarity.

Method A appends HLC data in one long series and correlates it with the given series.
Method B correlates H data with reference H data, L with L, C with C, and then multiplies all three values to calculate net correlation. Obviously it reduces the overall correlation, but I feel it tends to refine the resulting correlations.

My apologies for responding so late. I was trying to gather data and code correlation and make graphics for the explanation. This image shows one of the rare events when the correlations are pretty spot on. I'll make and share graphics when the resulting matches are highly misleading too even though the correlation values are pretty high.

@adambowen: you're spot on. Actually I've implemented two different algorithms: correlation and dynamic time warping to access series similarity. For DTW I have to use MSE like you said. For correlation, I can use both MSE (in which case it is equal to the cost of DTW's diagonal route, without any warping) and the actual Pearson's correlation formula. The images below resulted from using Pearson's correlation formula. I'll look up the terms you mentioned in your post and report back soon. In actuality, I don't have two separate time series. It's just one time series almost 10,000+ points long. I use a sliding window of width N to autocorrelate the time series to locate the events when the series behaved similar to today. If I am able to find good matches, I might be able to forecast the movement of the present time series based on how it moved after each of the matches identified. Thanks for your insight.

Best Answer

The two most common methods (in my experience) for comparing signals are the correlation and the mean squared error. Informally, if you imagine your signal as a point in some N-dimensional space (this tends to be easier if you imagine them as 3D points) then the correlation measures whether the points are in the same direction (from the "origin") and the mean squared error measures whether the points are in the same place (independent of the origin as long as both signals have the same origin). Which works better depends somewhat on the types of signal and noise in your system.

The MSE appears to be roughly equivalent to your example:

mse = 0;
for( int i=0; i<N; ++i )
    mse += (x[i]-y[i])*(x[i]-y[i]);
mse /= N;

note however that this isn't really Pearson correlation, which would be more like

xx = 0;
xy = 0;
yy = 0;

for( int i=0; i<N; ++i )
{
    xx += (x[i]-x_mean)*(x[i]-x_mean);
    xy += (x[i]-x_mean)*(y[i]-y_mean);
    yy += (y[i]-y_mean)*(y[i]-y_mean);
}

ppmcc = xy/std::sqrt(xx*yy);

given the signal means x_mean and y_mean. This is fairly close to the pure correlation:

corr = 0;
for( int i=0; i<N; ++i )
    corr += x[i]*y[i];

however, I think the Pearson correlation will be more robust when the signals have a strong DC component (because the mean is subtracted) and are normalised, so a scaling in one of the signals will not cause a proportional increase in the correlation.

Finally, if the particular example in your question is a problem then you could also consider the mean absolute error (L1 norm):

mae = 0;
for( int i=0; i<N; ++i )
    mae += std::abs(x[i]-y[i]);
mae /= N;

I'm aware of all three approaches being used in various signal and image processing applications, without knowing more about your particular application I couldn't say what would be likely to work best. I would note that the MAE and the MSE are less sensitive to exactly how the data is presented to them, but if the mean error is not really the metric you're interested in then they won't give you the results you're looking for. The correlation approaches can be better if you're more interested in the "direction" of your signal than the actual values involved, however it is more sensitive to how the data are presented and almost certainly requires some centring and normalisation to give the results you expect.

You might want to look up Phase Correlation, Cross Correlation, Normalised Correlation and Matched Filters. Most of these are used to match some sub-signal in a larger signal with some unknown time lag, but in your case you could just use the value they give for zero time lag if you know there is no lag between the two signals.

Related Question