In short, I wouldn't use both the partial $R^2$ and the standardized coefficients in the same analysis, as they are not independent. I would argue that it is usually probably more intuitive to compare relationships using the standardized coefficients because they relate readily to the model definition (i.e. $Y = \beta X$). The partial $R^2$, in turn, is essentially the proportion of unique shared variance between the predictor and dependent variable (dv) (so for the first predictor it is the square of the partial correlation $r_{x_1y.x_2...x_n}$). Furthermore, for a fit with a very small error all the coefficients' partial $R^2$ tend to 1, so they are not useful in identifying the relative importance of the predictors.
The effect size definitions
- standardized coefficient, $\beta_{std}$ - the coefficients $\beta$ obtained from estimating a model on the standardized variables (mean = 0, standard deviation = 1).
partial $R^2$- The proportion of residual variation explained by adding the predictor to the constrained model (the full model without the predictor). Same as:
- the square of the partial correlation between the predictor and the dependent variable, controlling for all the other predictors in the model. $R_{partial}^2 = r_{x_iy.X\setminus x_i}^2$.
- partial $\eta^2$ - the proportion of type III sums of squares from the predictor to the sum of squares attributed to the predictor and the error $\text{SS}_\text{effect}/(\text{SS}_\text{effect}+\text{SS}_\text{error})$
$\Delta R^2$ - The difference in $R^2$ between the constrained and full model. Equal to:
- squared semipartial correlation $r_{x_i(y.X\setminus x_i)}^2$
- $\eta^2$ for type III sum of squares $\text{SS}_\text{effect}/\text{SS}_\text{total}$ - what you were calculating as partial $R^2$ in the question.
All of these are closely related, but they differ as to how they handle the correlation structure between the variables. To understand this difference a bit better let us assume we have 3 standardized (mean = 0, sd = 1) variables $x,y,z$ whose correlations are $r_{xy}, r_{xz}, r_{yz}$. We will take $x$ as the dependent variable and $y$ and $z$ as the predictors. We will express all of the effect size coefficients in terms of the correlations so we can explicitly see how the correlation structure is handled by each. First we will list the coefficients in the regression model $x=\beta_{y}Y+\beta_{z}Z$ estimated using OLS. The formula for the coefficients:
\begin{align}\beta_{y} = \frac{r_{xy}-r_{yz}r_{zx}}{1-r_{yz}^2}\\
\beta_{z}= \frac{r_{xz}-r_{yz}r_{yx}}{1-r_{yz}^2},
\end{align}
The square root of the $R_\text{partial}^2$ for the predictors will be equal to:
$$\sqrt{R^2_{xy.z}} = \frac{r_{xy}-r_{yz}r_{zx}}{\sqrt{(1-r_{xz}^2)(1-r_{yz}^2)}}\\
\sqrt{R^2_{xz.y}} = \frac{r_{xz}-r_{yz}r_{yx}}{\sqrt{(1-r_{xy}^2)(1-r_{yz}^2)}}
$$
the $\sqrt{\Delta R^2}$ is given by:
$$\sqrt{R^2_{xyz}-R^2_{xz}}= r_{y(x.z)} = \frac{r_{xy}-r_{yz}r_{zx}}{\sqrt{(1-r_{yz}^2)}}\\
\sqrt{R^2_{xzy}-R^2_{xy}}= r_{z(x.y)}= \frac{r_{xz}-r_{yz}r_{yx}}{\sqrt{(1-r_{yz}^2)}}
$$
The difference between these is the denominator, which for the $\beta$ and $\sqrt{\Delta R^2}$ contains only the correlation between the predictors. Please note that in most contexts (for weakly correlated predictors) the size of these two will be very similar, so the decision will not impact your interpretation too much. Also, if the predictors that have a similar strength of correlation with the dependent variable and are not too strongly correlated the ratios of the $\sqrt{ R_\text{partial}^2}$ will be similar to the ratios of $\beta_{std}$.
Getting back to your code. The anova
function in R uses type I sum of squares by default, whereas the partial $R^2$ as described above should be calculated based on a type III sum of squares (which I believe is equivalent to a type II sum of squares if no interaction is present in your model). The difference is how the explained SS is partitioned among the predictors. In type I SS the first predictor is assigned all the explained SS, the second only the "left over SS" and the third only the left over SS from that, therefore the order in which you enter your variables in your lm
call changes their respective SS. This is most probably not what you want when interpreting model coefficients.
If you use a type II sum of squares in your Anova
call from the car
package in R, then the $F$ values for your anova will be equal to the $t$ values squared for your coefficients (since $F(1,n) = t^2(n)$). This indicates that indeed these quantities are closely tied, and should not be assessed independently. To invoke a type II sum of squares in your example replace anova(mod)
with Anova(mod, type = 2)
. If you include an interaction term you will need to replace it with type III sum of squares for the coefficient and partial R tests to be the same (just remember to change contrasts to sum using options(contrasts = c("contr.sum","contr.poly"))
before calling Anova(mod,type=3)
). Partial $R^2$ is the variable SS divided by the variable SS plus the residual SS. This will yield the same values as you listed from the etasq()
output. Now the tests and $p$-values for your anova results (partial $R^2$) and your regression coefficients are the same.
Credit
In general, I feel that the answer depends on the situation you are operating in. As you know, standardized coefficients are expressed on a standardized scale, and capture the ratio of the standard deviations of Y and the X of interest.
On the other hand, the semipartial correlation coefficient is the correlation between the criterion and a predictor that has been residualized with respect to all other predictors in the regression equation. Note that the criterion remains unaltered in the semi partial case. Only the predictor is residualized. Thus, after removing variance that the predictor has in common with other predictors, the semi partial expresses the correlation between the residualized predictor and the unaltered criterion.
What are the advantages of using the semi-partial correlations vs standardized coefficients? In general, the advantage of the semi partial is that the denominator of the coefficient (the total variance of the criterion, Y) remains the same no matter which predictor is being examined. This makes the semipartial very interpret-able. Also, the square of the semi partial can be interpreted as the proportion of the criterion variance associated uniquely with the predictor. This is useful.
Furthermore, you can also use the semipartial to fully deconstruct the variance
components in a regression analysis. How? Each squared semipartial represents the unique variance of that predictor shared with the criterion. Thus, the sum of all squared semipartials is the total unique variance.
In terms of broadening the comment & talking about uses, I note that each partial correlation coefficient is expressed on a different scale. This can make interpretation more difficult. On the other hand, semipartial correlations all on same scale (total variance of Y) do facilitate better comparison across predictors.
Best Answer
gungs answer is in my view a critique of the idea to compare the relative strength of different variables in an empirical analyses without having a model in mind how those variables interact or how the (true) joint distribution of all relevant variables looks like. Think of the example of the importance of athlete's height and weight gungs mentions. Nobody can proof that for example an additive linear regression is a good approximation of the conditional expectation function or in other words, height and weight might be important in a very complicated manner for athlete's performance. You can run a linear regression including both variables and compare the standardized coefficients but you do not know whether the results really make sense.
To give a Mickey Mouse example, looking at sports climber (my favorite sports), here is a list of top male climbers according to some performance measure taking from the site 8a.nu with information about their height, weight and year born (only those with available information). We standardize all variables beforehand so we can compare directly the association between one standard deviation changes in the predictors on one standard deviation change in the performance distribution. Excluding for the illustration the number one, Adam Ondra, who is unusual tall, we get the following result. :
Ignoring standard errors etc. at all, it seems that weight is more important than height or equally important. But one could argue that climbers have become better over time. Perhaps we should control for cohort effects, e.g. training opportunities through better indoor facilities? Let us include year of birth!
Now, we find that being young and being small is more important than being slim. But now another person could argue this holds only for top climbers? It could make sense to compare the standardized coefficients across the whole performance distribution (for example via quantile regression). And of course it might differ for female climbers who are much smaller and slimmer. Nobody knows.
This is a Mickey Mouse example of what I think gung refers to. I am not so skeptical, I think it can make sense to look at standardized coefficients, if you think that you have specified the right model or that additive separability make sense. But this depends as so often on the question at hand.
Regarding the other questions:
Yes, I think you could say that like this. The "wider range of X2 values" could arise through omitted variable bias by including important variables correlated with X1 but omitting those which are correlated with X2.
Omitted variable bias is again a good example why this holds. Omitted variables cause only problems (or bias) if they are correlated with the predictors as well as with the outcome, see the formula in the Wikipedia entry. If the true $r$ is exactly 0 than the variable is uncorrelated with the outcome and there is no problem (even if it is correlated with the predictors).
Other models have such as semipartial coefficients face the same problem. If your dataset is large enough, you can do for example nonparametric regression and try to estimate the full joint distribution without assumptions about the functional form (e.g. additive separability) to justify what you are doing but this is never a proof.
Summing up, I think it can make sense to compare standardized or semipartial coefficients but it depends and you have to reason yourself or others why you think it make sense.