Solved – Why is the correlation coefficient the slope of the regression line

correlationregression

I know that if we have a two-dimensional data set, convert the points to standard units (subtract the mean, and divide by the standard deviation), and then do simple linear regression, then the slope of the resulting linear regression line is equal to the correlation coefficient. Or, to put it another way, the slope of the regression line (in original units) is given by $b = r \sigma_y / \sigma_x$, where $\sigma_x$ is the standard deviation of the $x$-values and $\sigma_y$ is the standard deviation of the $y$-values.

Why is that? Is there some way to get some intuition for why this should be? I'm not looking for a mathematical derivation; I'd prefer something that focuses on intuition and that will it make sense to someone new to the topic.

Best Answer

Basically a correlation coefficient calculates the line of best fit between two variables. It does so using the formula for covariance.

The regression is also finding the line of best fit. But typically this is done using the least squares algorithm. It just so happens that linear regression and correlations are mathematically equivalent in this case.

So they're both trying to accomplish the same thing. Minimizing the (squared) distance between each point and the line of best fit. And in this particular case the two approaches are mathematically identical. But regression can be extended to many predictors, whereas correlation coefficient can only be between two.

Another way to think about it is that the slope represents how one variable changes as you increase the other. And this what you're looking for in both correlation and regression. You're looking for how changes in one variable lead to changes in another.

Edit: think about it this way. If slope = 0, as you increase one variable, the other variable doesn't change at all. This means no relationship.

On the other hand, if slope = 1, as you increase the first variable by one unit you increase the other by one unit as well. This means that the variable are related. If the slope = 10,000 then if you increase one variable by 1, the other one increases by 10,000. This is a very strong relationship!

But anything that is non-zero can be a strong relationship if the line of best fit fits the data well. If the data is all scattered and not close to the line of best fit, the slope may be large, but fit the data so badly that we don't really trust it. The significance test of both correlation coefficient and regression are testing whether the slope is non-zero and fits the data well enough to "trust".

Related Question