Correlation Analysis – Differences in Correlation for Individual vs Aggregated Data

aggregationcorrelationfeature selectionpearson-r

I have a sample of 1 million articles form the web with various features. I'm in the progress of selecting features to use in a metric/predictor for article quality. To get some insight into the data and what features may be best, I computed correlations between features.

The following issue occurred: For the features A="article views" and B="thumbs up count" the correlation is 0.32 (Pearson) or 0.26 (Spearman). Intuition suggests that there is a correlation indeed. The features by themselves look exponentially distributed to me (very many small values, very few large values). I wanted to look at a graphical view of the correlation, but the scatter plot did not reveal anything let alone a linear association.

So I aggregated the data as follows:

Order all data points by A (article views).
Divide the list into n=100 equally big chunks.
Compute sum(A) and sum(B) for all chunks.

Now when I plot the 100 value pairs sum(B) over sum(A) it shows an almost perfect straight line! (except for a minor aberration in the beginning). The Pearson correlation is almost 1.

What does this show / What should I make of this?

Does this mean that there is a strong dependency between A and B "in general", but for the individual articles there is "noise"? Can it have something to do with Ecological fallacy? Would you suggest a different way of exploring the association between these variables?

Best Answer

The issue is with the binning. When you order the variable $A$ by size, divide it to 100 equal bins and then sum the data in the bins you introduce the order. The bins at the beginning will have lower sums and higher sums at the end. This is perfectly normal, because that is the way bins were constructed.

Here is a simple simulation for illustration.

Generate 1 random million values of exponential distribution

    library(dplyr)
    a <- rexp(1e6)

enter image description here

Divide into 100 equal sized bins using quantiles:

    q <- quantile(a, seq(0, 1, length.out = 101))
    q[1] <- 0
    q[101] <- Inf
    bin <- cut(a, q)

Sum the values in the bins and plot them:

    dd <- data_frame(a=a,bin=bin)
    ee <- dd %>% group_by(bin) %>% summarise(a=sum(a))
    plot(q[-101], ee$a)

enter image description here

Compare the two graphs. The first is totally random, and in the second we have almost perfect relationship, because of the way we constructed the bins.

Now if we have another variable which is correlated with the original one, this introduced order does not disappear.

     b <- rexp(1e6) + a/3

enter image description here

Here we observe linear relationship, with a lot of noise, which is no surprise, because this is the way we constructed the second variable.

If we perform binning, we get that the relationship is much stronger:

     dd <- data_frame(a=a, b=b, bin=bin)
     ee <- dd %>% group_by(bin) %>% summarise_each(funs(sum), 
                                        a:b)
     plot(ee$a, ee$b)

enter image description here

So the binning you performed accentuated existing relationship, but this does not mean that relationship is actually that strong.

Given that your data is article views and thumbs up, it is natural to expect that articles with high number of views tend to have more thumbs up. But this relationship is very noisy as evidenced by your initial scatter plot of the data.

You should probably fit a regression to figure out the relationship and how strong it is.

Related Solutions

Solved – A reliable measure of series similarity – correlation just doesn’t cut it for me

The two most common methods (in my experience) for comparing signals are the correlation and the mean squared error. Informally, if you imagine your signal as a point in some N-dimensional space (this tends to be easier if you imagine them as 3D points) then the correlation measures whether the points are in the same direction (from the "origin") and the mean squared error measures whether the points are in the same place (independent of the origin as long as both signals have the same origin). Which works better depends somewhat on the types of signal and noise in your system.

The MSE appears to be roughly equivalent to your example:

mse = 0;
for( int i=0; i<N; ++i )
    mse += (x[i]-y[i])*(x[i]-y[i]);
mse /= N;

note however that this isn't really Pearson correlation, which would be more like

xx = 0;
xy = 0;
yy = 0;

for( int i=0; i<N; ++i )
{
    xx += (x[i]-x_mean)*(x[i]-x_mean);
    xy += (x[i]-x_mean)*(y[i]-y_mean);
    yy += (y[i]-y_mean)*(y[i]-y_mean);
}

ppmcc = xy/std::sqrt(xx*yy);

given the signal means x_mean and y_mean. This is fairly close to the pure correlation:

corr = 0;
for( int i=0; i<N; ++i )
    corr += x[i]*y[i];

however, I think the Pearson correlation will be more robust when the signals have a strong DC component (because the mean is subtracted) and are normalised, so a scaling in one of the signals will not cause a proportional increase in the correlation.

Finally, if the particular example in your question is a problem then you could also consider the mean absolute error (L1 norm):

mae = 0;
for( int i=0; i<N; ++i )
    mae += std::abs(x[i]-y[i]);
mae /= N;

I'm aware of all three approaches being used in various signal and image processing applications, without knowing more about your particular application I couldn't say what would be likely to work best. I would note that the MAE and the MSE are less sensitive to exactly how the data is presented to them, but if the mean error is not really the metric you're interested in then they won't give you the results you're looking for. The correlation approaches can be better if you're more interested in the "direction" of your signal than the actual values involved, however it is more sensitive to how the data are presented and almost certainly requires some centring and normalisation to give the results you expect.

You might want to look up Phase Correlation, Cross Correlation, Normalised Correlation and Matched Filters. Most of these are used to match some sub-signal in a larger signal with some unknown time lag, but in your case you could just use the value they give for zero time lag if you know there is no lag between the two signals.

Best Answer

Related Solutions

Solved – A reliable measure of series similarity – correlation just doesn’t cut it for me

Related Question