Solved – It is correct to use r squared instead of r for correlation of 2 variables

rr-squaredregression

I have seen many reports and software in which the coefficient of determination $R^2$ is used instead of $r$ when describing the correlation of two variables before doing linear regression.

I am clear about the meaning of both coefficients. In my opinion $R^2$ should only be used to evaluate the goodness of fit after linear regression.

It seems like some people mix up correlation and regression but I see this many times, so I started to hesitate.

Is there any reason to use $R^2$ to evaluate just correlation?

I just found answers the other way around

http://www.win-vector.com/blog/2013/02/dont-use-correlation-to-track-prediction-performance/

Best Answer

There are two issues at play here: The mathematics of statistics, and the conventions of communication of statistics. You're right that it's unconventional to report $R^2$ for a correlation, at least in most fields. But there's nothing wrong with it mathematically.

You can see this more clearly if you consider the case of simple univariate linear regression (a regression model with one continuous dependent variable and one continuous predictor). TO demonstrate, I'll use the iris dataset, which comes built into R. Here are the first six lines:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I can calculate the correlation between Sepal.Length and Sepal.Width

> cor(iris$Sepal.Length, iris$Sepal.Width)
[1] -0.1175698

I'll square that correlation and save it as Rsq for comparison with the regression output.

> r <- cor(iris$Sepal.Length, iris$Sepal.Width)
> Rsq <- r^2

A simple linear regression predicting Sepal.Length from Sepal.Width:

> summary(lm(Sepal.Length ~ Sepal.Width, data = iris))

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5561 -0.6333 -0.1120  0.5579  2.2226 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.5262     0.4789   13.63   <2e-16 ***
Sepal.Width  -0.2234     0.1551   -1.44    0.152    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8251 on 148 degrees of freedom
Multiple R-squared:  0.01382,   Adjusted R-squared:  0.007159 
F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

> Rsq
[1] 0.01382265

Note that the Multiple R-squared statistic reported is exactly the same as the squared correlation between the two predictors. Of course, this works just as well if you reverse which variable is the predictor and which is the outcome in the regression model:

> summary(lm(Sepal.Width ~ Sepal.Length, data = iris))

Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1095 -0.2454 -0.0167  0.2763  1.3338 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.41895    0.25356   13.48   <2e-16 ***
Sepal.Length -0.06188    0.04297   -1.44    0.152    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4343 on 148 degrees of freedom
Multiple R-squared:  0.01382,   Adjusted R-squared:  0.007159 
F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

When you have more than one predictor in a regression model, then $R^2$ is the squared multiple correlation instead of just the squared bivariate correlation. But the idea behind it is very much the same.

The conventions around reporting statistics often obscure how similar many of our tests and measures are; $r$ and $R^2$ are a great example of that.

Related Question