Regression – How to Average Correlation Values Effectively?

correlationmeanregression

Let's say I test how variable Y depends on variable X under different experimental conditions and obtain the following graph:

enter image description here

The dash lines in the graph above represent linear regression for each data series (experimental setup) and the numbers in the legend denote the Pearson correlation of each data series.

I would like to calculate the "average correlation" (or "mean correlation") between X and Y. May I simply average the r values? What about the "average determination criterion", $R^2$? Should I calculate the average r and than take the square of that value or should I compute the average of individual $R^2$'s?

Best Answer

The simple way is to add a categorical variable $z$ to identify the different experimental conditions and include it in your model along with an "interaction" with $x$; that is, $y \sim z + x\#z$. This conducts all five regressions at once. Its $R^2$ is what you want.

To see why averaging individual $R$ values may be wrong, suppose the direction of the slope is reversed in some of the experimental conditions. You would average a bunch of 1's and -1's out to around 0, which wouldn't reflect the quality of any of the fits. To see why averaging $R^2$ (or any fixed transformation thereof) is not right, suppose that in most experimental conditions you had only two observations, so that their $R^2$ all equal $1$, but in one experiment you had a hundred observations with $R^2=0$. The average $R^2$ of almost 1 would not correctly reflect the situation.

Related Question