Solved – Do correlation or coefficient of determination relate to the percentage of values that fall along a regression line

correlationr-squaredregression

Correlation, $r$, is a measure of linear association between two variables. Coefficient of determination, $r^2$, is a measure of how much of the variability in one variable can be "explained by" variation in the other.

For example, if $r = 0.8$ is the correlation between two variables, then $r^2 = 0.64$. Hence, 64% of the variability in one can be explained by differences in the other. Right?

My question is, for the example stated, is either of the following statements correct?

  1. 64% of values fall along the regression line
  2. 80% of values fall along the regression line

Best Answer

The first part of this is basically correct - but it's 64% of the variation is explained by the model. In a simple linear regression: Y ~ X, if $R^2$ is .64 it means that 64% of the variation in Y is determined by the linear relationship between Y and X. It is possible to have a strong relationship with very low $R^2$, if the relationship is strongly non-linear.

Regarding your two numbered questions, neither is correct. Indeed, it is possible that none of the points may lie exactly on the regression line. That's not what's being measured. Rather, it is a question of how close the average point is to the line. If all or nearly all points are close (even if none are exactly on the line) then $R^2$ will be high. If most points are far from the line, $R^2$ will be low. If most points are close but a few are far, then the regression is incorrect (problem of outliers). Other things can go wrong, too.

In addition, I've left the notion of "far" rather vague. This will depend on how spread out the X's are. Making these notions precise is part of what you learn in a course on regression; I won't get into it here.