Correlation Analysis – Differences in Correlation for Individual vs Aggregated Data

aggregationcorrelationfeature selectionpearson-r

I have a sample of 1 million articles form the web with various features. I'm in the progress of selecting features to use in a metric/predictor for article quality. To get some insight into the data and what features may be best, I computed correlations between features.

The following issue occurred: For the features A="article views" and B="thumbs up count" the correlation is 0.32 (Pearson) or 0.26 (Spearman). Intuition suggests that there is a correlation indeed. The features by themselves look exponentially distributed to me (very many small values, very few large values). I wanted to look at a graphical view of the correlation, but the scatter plot did not reveal anything let alone a linear association.

So I aggregated the data as follows:

  1. Order all data points by A (article views).
  2. Divide the list into n=100 equally big chunks.
  3. Compute sum(A) and sum(B) for all chunks.

Now when I plot the 100 value pairs sum(B) over sum(A) it shows an almost perfect straight line! (except for a minor aberration in the beginning). The Pearson correlation is almost 1.

What does this show / What should I make of this?

Does this mean that there is a strong dependency between A and B "in general", but for the individual articles there is "noise"? Can it have something to do with Ecological fallacy? Would you suggest a different way of exploring the association between these variables?

Best Answer

The issue is with the binning. When you order the variable $A$ by size, divide it to 100 equal bins and then sum the data in the bins you introduce the order. The bins at the beginning will have lower sums and higher sums at the end. This is perfectly normal, because that is the way bins were constructed.

Here is a simple simulation for illustration.

Generate 1 random million values of exponential distribution

    library(dplyr)
    a <- rexp(1e6)

enter image description here

Divide into 100 equal sized bins using quantiles:

    q <- quantile(a, seq(0, 1, length.out = 101))
    q[1] <- 0
    q[101] <- Inf
    bin <- cut(a, q)

Sum the values in the bins and plot them:

    dd <- data_frame(a=a,bin=bin)
    ee <- dd %>% group_by(bin) %>% summarise(a=sum(a))
    plot(q[-101], ee$a)

enter image description here

Compare the two graphs. The first is totally random, and in the second we have almost perfect relationship, because of the way we constructed the bins.

Now if we have another variable which is correlated with the original one, this introduced order does not disappear.

     b <- rexp(1e6) + a/3 

enter image description here

Here we observe linear relationship, with a lot of noise, which is no surprise, because this is the way we constructed the second variable.

If we perform binning, we get that the relationship is much stronger:

     dd <- data_frame(a=a, b=b, bin=bin)
     ee <- dd %>% group_by(bin) %>% summarise_each(funs(sum), 
                                        a:b)
     plot(ee$a, ee$b)

enter image description here

So the binning you performed accentuated existing relationship, but this does not mean that relationship is actually that strong.

Given that your data is article views and thumbs up, it is natural to expect that articles with high number of views tend to have more thumbs up. But this relationship is very noisy as evidenced by your initial scatter plot of the data.

You should probably fit a regression to figure out the relationship and how strong it is.

Related Question