Multiple Linear Regression – Can $R^2$ be Calculated from Correlation Coefficients?

correlationr-squaredregression

In simple linear regression, $R^2$ is equivalent to the squared correlation of a dependent and an independent variable. Is this also true for multiple linear regression?

For example, I measured trait openness to predict creativity in a simple linear regression. If I square the measured correlation between the two, I get the coefficient of determination.

Then I have measured the traits extraversion, openness and intellect to predict creativity in a multiple linear regression. Can I take those observed correlations, square and add them, and get the coefficent of determination for this kind of regression too?

Best Answer

The coefficient-of-determination can be determined from the correlations: Consider a multiple linear regression with $m$ explanatory vectors and an intercept term. First we define the correlation values for all the variables in the problem $r_i = \mathbb{Corr}(\mathbf{y},\mathbf{x}_i)$ and $r_{i,j} = \mathbb{Corr}(\mathbf{x}_i,\mathbf{x}_j)$. Now define the goodness of fit vector and design correlation matrix respectively by:

$$\boldsymbol{r}_{\mathbf{y},\mathbf{x}} = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_m \end{bmatrix} \quad \quad \quad \boldsymbol{r}_{\mathbf{x},\mathbf{x}} = \begin{bmatrix} r_{1,1} & r_{1,2} & \cdots & r_{1,m} \\ r_{2,1} & r_{2,2} & \cdots & r_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ r_{m,1} & r_{m,2} & \cdots & r_{m,m} \\ \end{bmatrix}.$$

The goodness-of-fit vector contains the correlations between the response vector and each of the explanatory vectors. The design correlation matrix contains the correlations between each pair of explanatory vectors. (Please note that these names are something I have made up, since neither matrix has a standard name that I am aware of. The first vector measures the goodness-of-fit of simple regressions on each of the individual explanatory vectors, which is why I use this name.) Now, with a bit of linear algebra it can be shown that the coefficient-of-determination for the multiple linear regression is given by the following quadratic form:

$$R^2 = \boldsymbol{r}_{\mathbf{y},\mathbf{x}}^\text{T} \boldsymbol{r}_{\mathbf{x},\mathbf{x}}^{-1} \boldsymbol{r}_{\mathbf{y},\mathbf{x}}.$$

This form for the coefficient-of-determination is not all that well-known to statistical practitioners, but it is a very useful result, and assists in framing the goodness-of-fit of the multiple linear regression in its most fundamental terms. The square-root of the coefficient of determination gives us the multiple correlation coefficient, which is a multivariate extension of the absolute correlation. In the special case where $m=1$ you get $R^2 = r_1^2$ so that the coefficient-of-determination is the square of the correlation between the response vector and the (single) explanatory variable.

As you can see, this form for the coefficient-of-determination for the multiple linear regression is framed fully in terms of correlations between the pairs of vectors going into the regression. This means that if you have a matrix of the pairwise correlations between all the vectors in the multiple regression (the response vector and each of the explanatory vectors) then you can directly determine the coefficient-of-determination without fitting the regression model. This result is more commonly presented in multivariate analysis (see e.g., Mardia, Kent and Bibby 1979, p. 168).


The coefficient-of-determination is not generally equal to the sum of individual coefficients: In the case where all the explanatory vectors are uncorrelated with each other you get $\boldsymbol{r}_{\mathbf{x},\mathbf{x}} = \boldsymbol{I}$ which means that the above quadratic form reduces to $R^2 = \sum r_i^2$. However, this is a special case that only arises in practice in cases where the explanatory variables are set by the researcher. The explanatory variables are not generally uncorrelated, and so the coefficient-of-determination is determined by the above quadratic form.

It is also useful to note that the coefficient-of-determination in a multiple linear regression can be above or below the sum of the individual coefficients-of-determination for corresponding simple linear regressions. Usually it is below this sum (since the total explanatory power is usually less than the sum of its parts) but sometimes it is above this sum.

Related Question