Multiple-Regression – How R^2 Represents Total Explained Variance with Unique Contributions

linear modelmultiple regressionsums-of-squares

Background

In regression analysis, $R^2$, the squared multiple correlation, represents the proportion of explained variance by the regression model. Most software's default setting uses Type-III sums of squares (SS), as has been explained and illustrated excellently in other CV questions (e.g. here and here). This effectively means that 'overlapping' explained variance is discarded: if two predictors both explain the same variance in the criterion, the software cannot know which predictor the explained variance 'belongs to' and it is discarded from the model.

Because of these "Type-III SS" dynamics, each regression coefficient represents the unique contribution of each predictor to the explanation of the variance in the dependent variable.

Using these regression coefficients, it is possible to construct the regression equation:

$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$

In this equation, $\hat{y}$ represents the predicted value of the dependent variable ($y$), $x_1$ and $x_2$ represent two predictors, $\beta_0$ represents the intercept (the predicted value of $y$ if both predictors are $0$), and $\beta_1$ and $\beta_2$ represent the regression coefficients of both predictors.

$\beta_1$ and $\beta_2$ only represent the unique contribution of each predictor to the prediction of $\hat{y}$ – the overlapping variance between these two predictors has been removed.

One can complete this equation for each observation and store the resulting predicted ($\hat{y}$) values. It is then possible to compute the correlation between these $\hat{y}$ values and the observed $y$ values. This correlation is called the multiple correlation, $R$, and it's square, $R^2$ is the proportion explained variance.

This proportion explained variance, $R^2$, represents the variance explained in $y$ by both $x_1$ and $x_2$, i.e. by the full model, *including the overlap in those predictors$.

If the latter were not the case, situations would occur where $R^2$ is lower than the bivariate correlation of one or both predictor(s) with the criterion ($y$). And this never happens (right?).

Now, what puzzles me and a friend is the following.

Question

Since $\hat{y}$ is computed using regression coefficients that only account for unique explained variance, and $r_{y \hat{y}}^2$ (i.e. $R^2$) represents the proportion of explained variance by both predictors including shared explained variance, where does this shared explained variance come from?

If $\beta_1$ and $\beta_2$ only represent the unique contribution of each predictor, how does the shared contribution suddenly end up in $R^2$? Wouldn't this require another term in the regression equation? $\beta_0$ is constant, so by definition cannot explain any variance in $y$. Or did we always misunderstand and does $R^2$ actually not represent the the full yellow and red circled in Gung's example here, but instead only the outer 'half moon' sections?

I realise that this question may be a subtle variation on the questions already answered so excellently and clearly by Gung (see links above), but despite reading these two, having Googled this a few times, and having discussed this with friends, we don't manage to figure this out.

Best Answer

I hope I got you right: let $X$ be the covariates matrix and $y$ the response variable. The OLS coefficients estimate is defined as $\hat{\beta}=(X^TX)^{-1}X^Ty$ and the predicted values are defined $\hat{y}=X\hat{\beta}=X(X^TX)^{-1}X^Ty$, which is the projection of $y$ to the subspace spanned by $X$ columns. Under the normal model you also get $\hat{\beta}\sim~N(\beta,\sigma^2(X^TX)^{-1})$ and $\hat{y}\sim~N(\mu,\sigma^2X(X^TX)^{-1}X^T)$.

When observing the marginal distributions, we get $\hat{\beta}_j\sim~N(\beta,\sigma^2(X^TX)^{-1}_{jj})$ and $\hat{y}_i\sim~N(\mu_i,\sigma^2x_i(X^TX)^{-1}x_i^T)$, but this does not mean the $\beta$ variances matrix is diagonal (and the same applies for the predictions. In fact, when discussing GLM submodels (linear regression included) it is highly unlikely to encounter diagonal covariance matrices.

Now, let $e$ be the residuals: $e=y-\hat{y}=y-X(X^TX)^{-1}X^Ty=(I-X(X^TX)^{-1}X^T)y$.

$SSE=e^Te=y^T(I-X(X^TX)^{-1}X^T)^T(I-X(X^TX)^{-1}X^T)y= y^T(I^T-(X(X^TX)^{-1}X^T)^T)(I-X(X^TX)^{-1}X^T)y= y^T(I-(X(X^TX)^{-1}X^T))(I-X(X^TX)^{-1}X^T)y= y^T(I-X(X^TX)^{-1}X^T)y$

$R^2$ is defined as $R^2=1-\frac{SSE}{SST}$, where $SST=\sum_i{(y_i-\bar{y})^2}$.

Now to some intuitive handwaving: As you can see, $SSE$ "contains information" from the whole $\beta$ variances matrix (i.e, both unique and shared) and not just the diagonals (which stand for unique contributions). This explains how the shared contribution ends up in $R^2$.

Leaving the algebra aside, let me try and simplify the math: $SSE=\sum_i{(y_i-\hat{y}_i)^2}$, this is the sum of squared prediction error, so actually $R^2=1-\frac{SSE}{SST}$ is computed using the regression equation.

Furthermore, as $X$ is the predictors matrix ($x_1, x_2$, etc.) and the regression coefficients are computed as a bunch using the whole $X$ matrix, then each coefficient contains some covariance information. The situation where each coefficient contains only unique information can occur only if there's no covariance or if you compute separate regression for each coefficient which is very wrong.

Related Question