Regression – How to Average Correlation Values Effectively?


Let's say I test how variable Y depends on variable X under different experimental conditions and obtain the following graph:

The dash lines in the graph above represent linear regression for each data series (experimental setup) and the numbers in the legend denote the Pearson correlation of each data series.

I would like to calculate the "average correlation" (or "mean correlation") between X and Y. May I simply average the r values? What about the "average determination criterion", $R^2$? Should I calculate the average r and than take the square of that value or should I compute the average of individual $R^2$'s?

Best Answer

The simple way is to add a categorical variable $z$ to identify the different experimental conditions and include it in your model along with an "interaction" with $x$; that is, $y \sim z + x\#z$. This conducts all five regressions at once. Its $R^2$ is what you want.

To see why averaging individual $R$ values may be wrong, suppose the direction of the slope is reversed in some of the experimental conditions. You would average a bunch of 1's and -1's out to around 0, which wouldn't reflect the quality of any of the fits. To see why averaging $R^2$ (or any fixed transformation thereof) is not right, suppose that in most experimental conditions you had only two observations, so that their $R^2$ all equal $1$, but in one experiment you had a hundred observations with $R^2=0$. The average $R^2$ of almost 1 would not correctly reflect the situation.

