Solved – Importance of predictors in multiple regression: Partial $R^2$ vs. standardized coefficients

multiple regressionrr-squaredregressionregression coefficients

I am wondering what the exact relationship between partial $R^2$ and coefficients in a linear model is and whether I should use only one or both to illustrate the importance and influence of factors.

As far as I know, with summary I get estimates of the coefficients, and with anova the sum of squares for each factor – the proportion of the sum of squares of one factor divided by the sum of the sum of squares plus residuals is partial $R^2$ (the following code is in R).

library(car)
mod<-lm(education~income+young+urban,data=Anscombe)
    summary(mod)

Call:
lm(formula = education ~ income + young + urban, data = Anscombe)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.240 -15.738  -1.156  15.883  51.380 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.868e+02  6.492e+01  -4.418 5.82e-05 ***
income       8.065e-02  9.299e-03   8.674 2.56e-11 ***
young        8.173e-01  1.598e-01   5.115 5.69e-06 ***
urban       -1.058e-01  3.428e-02  -3.086  0.00339 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.69 on 47 degrees of freedom
Multiple R-squared:  0.6896,    Adjusted R-squared:  0.6698 
F-statistic: 34.81 on 3 and 47 DF,  p-value: 5.337e-12

anova(mod)
Analysis of Variance Table

Response: education
          Df Sum Sq Mean Sq F value    Pr(>F)    
income     1  48087   48087 67.4869 1.219e-10 ***
young      1  19537   19537 27.4192 3.767e-06 ***
urban      1   6787    6787  9.5255  0.003393 ** 
Residuals 47  33489     713                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The size of the coefficients for 'young' (0.8) and 'urban' (-0.1, about 1/8 of the former, ignoring '-') does not match the explained variance ('young' ~19500 and 'urban' ~6790, i.e. around 1/3).

So I thought I would need to scale my data because I assumed that if a factor's range is much wider than another factor's range their coefficients would be hard to compare:

Anscombe.sc<-data.frame(scale(Anscombe))
mod<-lm(education~income+young+urban,data=Anscombe.sc)
summary(mod)

Call:
lm(formula = education ~ income + young + urban, data = Anscombe.sc)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29675 -0.33879 -0.02489  0.34191  1.10602 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.084e-16  8.046e-02   0.000  1.00000    
income       9.723e-01  1.121e-01   8.674 2.56e-11 ***
young        4.216e-01  8.242e-02   5.115 5.69e-06 ***
urban       -3.447e-01  1.117e-01  -3.086  0.00339 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5746 on 47 degrees of freedom
Multiple R-squared:  0.6896,    Adjusted R-squared:  0.6698 
F-statistic: 34.81 on 3 and 47 DF,  p-value: 5.337e-12

anova(mod)
Analysis of Variance Table

Response: education
          Df  Sum Sq Mean Sq F value    Pr(>F)    
income     1 22.2830 22.2830 67.4869 1.219e-10 ***
young      1  9.0533  9.0533 27.4192 3.767e-06 ***
urban      1  3.1451  3.1451  9.5255  0.003393 ** 
Residuals 47 15.5186  0.3302                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1    

But that doesn't really make a difference, partial $R^2$ and the size of the coefficients (these are now standardized coefficients) still do not match:

22.3/(22.3+9.1+3.1+15.5)
# income: partial R2 0.446, Coeff 0.97
9.1/(22.3+9.1+3.1+15.5)
# young:  partial R2 0.182, Coeff 0.42
3.1/(22.3+9.1+3.1+15.5)
# urban:  partial R2 0.062, Coeff -0.34

So is it fair to say that 'young' explains three times as much variance as 'urban' because partial $R^2$ for 'young' is three times that of 'urban'? Why is the coefficient of 'young' then not three times that of 'urban' (ignoring the sign)?

I suppose the answer for this question will then also tell me the answer to my initial query: Should I use partial $R^2$ or coefficients to illustrate the relative importance of factors? (Ignoring direction of influence – sign – for the time being.)

Edit:

Partial eta-squared appears to be another name for what I called partial $R^2$. etasq {heplots} is a useful function that produces similar results:

etasq(mod)
          Partial eta^2
income        0.6154918
young         0.3576083
urban         0.1685162
Residuals            NA

Best Answer

In short, I wouldn't use both the partial $R^2$ and the standardized coefficients in the same analysis, as they are not independent. I would argue that it is usually probably more intuitive to compare relationships using the standardized coefficients because they relate readily to the model definition (i.e. $Y = \beta X$). The partial $R^2$, in turn, is essentially the proportion of unique shared variance between the predictor and dependent variable (dv) (so for the first predictor it is the square of the partial correlation $r_{x_1y.x_2...x_n}$). Furthermore, for a fit with a very small error all the coefficients' partial $R^2$ tend to 1, so they are not useful in identifying the relative importance of the predictors.


The effect size definitions

  • standardized coefficient, $\beta_{std}$ - the coefficients $\beta$ obtained from estimating a model on the standardized variables (mean = 0, standard deviation = 1).
  • partial $R^2$- The proportion of residual variation explained by adding the predictor to the constrained model (the full model without the predictor). Same as:

    • the square of the partial correlation between the predictor and the dependent variable, controlling for all the other predictors in the model. $R_{partial}^2 = r_{x_iy.X\setminus x_i}^2$.
    • partial $\eta^2$ - the proportion of type III sums of squares from the predictor to the sum of squares attributed to the predictor and the error $\text{SS}_\text{effect}/(\text{SS}_\text{effect}+\text{SS}_\text{error})$
  • $\Delta R^2$ - The difference in $R^2$ between the constrained and full model. Equal to:

    • squared semipartial correlation $r_{x_i(y.X\setminus x_i)}^2$
    • $\eta^2$ for type III sum of squares $\text{SS}_\text{effect}/\text{SS}_\text{total}$ - what you were calculating as partial $R^2$ in the question.

All of these are closely related, but they differ as to how they handle the correlation structure between the variables. To understand this difference a bit better let us assume we have 3 standardized (mean = 0, sd = 1) variables $x,y,z$ whose correlations are $r_{xy}, r_{xz}, r_{yz}$. We will take $x$ as the dependent variable and $y$ and $z$ as the predictors. We will express all of the effect size coefficients in terms of the correlations so we can explicitly see how the correlation structure is handled by each. First we will list the coefficients in the regression model $x=\beta_{y}Y+\beta_{z}Z$ estimated using OLS. The formula for the coefficients: \begin{align}\beta_{y} = \frac{r_{xy}-r_{yz}r_{zx}}{1-r_{yz}^2}\\ \beta_{z}= \frac{r_{xz}-r_{yz}r_{yx}}{1-r_{yz}^2}, \end{align} The square root of the $R_\text{partial}^2$ for the predictors will be equal to:

$$\sqrt{R^2_{xy.z}} = \frac{r_{xy}-r_{yz}r_{zx}}{\sqrt{(1-r_{xz}^2)(1-r_{yz}^2)}}\\ \sqrt{R^2_{xz.y}} = \frac{r_{xz}-r_{yz}r_{yx}}{\sqrt{(1-r_{xy}^2)(1-r_{yz}^2)}} $$

the $\sqrt{\Delta R^2}$ is given by:

$$\sqrt{R^2_{xyz}-R^2_{xz}}= r_{y(x.z)} = \frac{r_{xy}-r_{yz}r_{zx}}{\sqrt{(1-r_{yz}^2)}}\\ \sqrt{R^2_{xzy}-R^2_{xy}}= r_{z(x.y)}= \frac{r_{xz}-r_{yz}r_{yx}}{\sqrt{(1-r_{yz}^2)}} $$

The difference between these is the denominator, which for the $\beta$ and $\sqrt{\Delta R^2}$ contains only the correlation between the predictors. Please note that in most contexts (for weakly correlated predictors) the size of these two will be very similar, so the decision will not impact your interpretation too much. Also, if the predictors that have a similar strength of correlation with the dependent variable and are not too strongly correlated the ratios of the $\sqrt{ R_\text{partial}^2}$ will be similar to the ratios of $\beta_{std}$.

Getting back to your code. The anova function in R uses type I sum of squares by default, whereas the partial $R^2$ as described above should be calculated based on a type III sum of squares (which I believe is equivalent to a type II sum of squares if no interaction is present in your model). The difference is how the explained SS is partitioned among the predictors. In type I SS the first predictor is assigned all the explained SS, the second only the "left over SS" and the third only the left over SS from that, therefore the order in which you enter your variables in your lm call changes their respective SS. This is most probably not what you want when interpreting model coefficients.

If you use a type II sum of squares in your Anova call from the car package in R, then the $F$ values for your anova will be equal to the $t$ values squared for your coefficients (since $F(1,n) = t^2(n)$). This indicates that indeed these quantities are closely tied, and should not be assessed independently. To invoke a type II sum of squares in your example replace anova(mod) with Anova(mod, type = 2). If you include an interaction term you will need to replace it with type III sum of squares for the coefficient and partial R tests to be the same (just remember to change contrasts to sum using options(contrasts = c("contr.sum","contr.poly")) before calling Anova(mod,type=3)). Partial $R^2$ is the variable SS divided by the variable SS plus the residual SS. This will yield the same values as you listed from the etasq() output. Now the tests and $p$-values for your anova results (partial $R^2$) and your regression coefficients are the same.


Credit