Solved – Coefficient of determination $R^{2}$ for each variable in multiple regression

multiple regressionr-squaredregression coefficients

In multiple linear regression, is the coefficient of determination calculated for each independent variable, or is it only for the model obtained, that is, in relation to all the independent variables? I want to make an analysis in which I relate the model coefficients (standardized beta weights) to an $R^{2}$ value for each variable.

Best Answer

The coefficient of determination is defined for the model as whole and not for individual variables. However, there is a technique called ANOVA which can roughly be thought of as breaking $R^2$ into contributions from each variable.

Recall that the coefficient of determination is defined in terms of the sums of squares of residuals:

$$ \begin{align} R^2 & = 1 - {SS_{\rm res}\over SS_{\rm tot}} \\ SS_\text{tot} & =\sum_{i=1}^n (\bar{y} - y_i)^2 \\ SS_\text{res} & =\sum_{i=1}^n (\hat{y}_i-y_i)^2 \\ \end{align}$$

Where $\hat{y}$ is the prediction vector of the model. Since we can't make a prediction $\hat{y}$ without considering all of the variables in the model.

But look at the equation for $SS_\text{tot}$ more again. This has the exact same form as the $SS_res$ if it were a trivial model with only an intercept term; such a model would predict $\hat{y}_i = \bar{y}$ for all $i$. This suggests that we are not comparing one model to some platonic ideal, but actually comparing two different models. This insight can be generalized into a chain of models:

$$ \frac{SS_1}{SS_{\text{tot}}} + \frac{SS_2}{SS_{\text{tot}}} + ... + \frac{SS_k}{SS_{\text{tot}}}= 1 $$

If we consider a chain of models, starting from the intercept only model and adding one variable at a time, then the quantity $\frac{SS_j - SS_{j-1}}{SS_\text{tot}}$ can be intrepretted as the "amount of variance explained by the $j$-th variable. As a concrete example, here is the output of the anova() function on the built-in airquality dataset:

Analysis of Variance Table

Response: Ozone
           Df Sum Sq Mean Sq F value    Pr(>F)    
Solar.R     1  14780   14780 33.9704 6.216e-08 ***
Wind        1  39969   39969 91.8680 5.243e-16 ***
Temp        1  19050   19050 43.7854 1.584e-09 ***
Month       1   1701    1701  3.9101   0.05062 .  
Day         1    619     619  1.4220   0.23576    
Residuals 105  45683     435      

This is called the "sequential" analysis of variance. The Sum Sq column sums to the total sums of squares of the entire dataset, so we can see that Wind explains twice as much variance of Temp. This interpretation is subject to many caveats: it is sensitive to the order in which variables are added, and the F-scores and associated P values on the left are only meaningful for a purely linear model, etc. Nevertheless, if we take that Sum Sq column and divide by total sums of squares:

Solar.R    0.12
Wind       0.33
Temp       0.16
Month      0.01
Day        0.01
Residuals  0.38

We get a table where ever line item is roughly analogous to the quote-unquote "$R^2$" for each variable (plus one line item for the unexplained residual), although that terminology is never used, as far as I know. People talk about the proportion of variance explained instead.

Here are some additional resources if you want to read further:

  1. https://math.stackexchange.com/questions/1792351/sequential-anova-r
  2. https://astrostatistics.psu.edu/su07/R/html/stats/html/anova.lm.html
  3. http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HseqRegnSsq/seqRegnSsq4.html