Solved – Pearson correlation after aggregation

aggregationcorrelationpearson-r

The following table represents values about a variable Y observed in five people; I know the age of each person.

AGE   Y
10    50
10    29
20    30
20    33
30    15

If I measure the Pearson correlation between Y and age I get -0.7792. Y seems to be negatively correlated to Age. If I first aggregate data based on age:

AGE   Y
10    39.5
20    31.5
30    15

The correlation changes to -0.9805

In the real example I am working on (5k data points), the change is even bigger, from -0.19 to -0.69 so aggregating data completely changes the interpretation of the study. My questions are:

1) How do you interpret this huge difference?

2) Does measuring correlation on aggregate data make sense in this case? And if not, since sometime we don't have access to the single data points but just to the aggregated (averaged) data, what conclusions could we draw from a correlation analysis?

I am reading this papers "The Effects of Data Aggregation in Statistical Analysis" http://onlinelibrary.wiley.com/doi/10.1111/j.1538-4632.1976.tb00549.x/pdf
but my questions are still unanswered.

Best Answer

Mostly copied from Nick Cox's comments: You could further collapse by whether people are child or adult (cut at 18). The data reduce to 2 points; the correlation is $-$1. In broad terms, averaging reduces scatter and increases correlation. There could be exceptions: you can always imagine groups such that after averaging the correlation is nearer 0 so there is no theorem here, but it is a broad tendency.

A moral is to keep thinking of patterns on scatter plots and to remember that correlation is an automaton answering one question only: how close are these points to a summary line? (Two questions, in that sign ($+$ or $-$) is reported too.)

You would want to look at the data and think about the relationship. You should not try to interpret correlations in abstraction; they aren't magic numbers with meaning deeper than the relationship they summarize (or misrepresent). Also, and even more important, what level of analysis is important to you? If it is what is going on at individual level, then averaging is an irrelevant distraction. If you don't have the individual data, you should use the finest subdivision possible and flag the difficulties.

Related Question