Solved – Share of variance explained by one variable

linearregressionvariance

I have two independent variables, x1 and x2.

If I fit a linear model using only x1, I get this result

Model 1: R^2 = 0.66, p value = 4.34E-10 (***)

If I fit a linear model using both x1 and x2, I get this result

Model 2: R^2 = 0.66, p value x1 = 9.68E-05 (***), p value x2 = 0.56 ( )

Given that there is no increase in R^2 from Model 1 to Model 2, and considering that x2 is not significant, I would say that x2 explains 0% of the variance of the dependent variable.

However, if I fit a linear model using only x2, I get this result

Model 3: R^2 = 0.48, p value = 1.17E-06 (***)

In Model 3 x2 alone explains 48% of the variance, and it is highly significant.

So, what is the share of variance explained by x2: 0% or 48%?

Note: the two variables are also correlated, with Pearson r = 0.75

Best Answer

You've actually stumbled on a pretty interesting idea that gets explained in linear regression classes. The answer to your question ("what is the share....") is that it depends on which model you're using.

Because your variables are highly correlated, they explain much of the same variance in the data you're trying to model. That means that while both variables are useful in predicting y on their own, when you put them both into to the model at the same time they're "explaining the same variance" and so only one of them will seem to be "significant".

As an extreme example, imagine fitting a model $y = \beta_{0}x_{1}+\beta_{1}x_{1}+\epsilon$ (this is unrealistic, since your variables are totally co-linear you would run into a slew of issues but it illustrates the idea) and looking to see which of the two variables were useful in predicting y. Of course in the separate models $y = \beta_{0}x_{1}+\epsilon$ and $y = \beta_{1}x_{1}+\epsilon$ you would find that both $x_{1}$ and $x_{1}$ were useful in predicting y. But in the general model, since they both explain the exact same variance in y, you could set either $\beta_{0}$ or $\beta_{1}$ to 0 and predict y just as well.

Related Solutions

Solved – NMDS and variance explained by vector fitting

I wouldn't place much stock in "rules of thumb" such as this. It is dependent upon so many things such as the number of variables, the number of sites, what dissimilarity you use etc. Also note that the vector fitting approach is inherently linear and we have no reason to presume that the relationship between the variable and the NMDS configuration is linear.

The key thing is that it is a small/modest correlation but that it is significant. But you probably want to look at the linearity assumption.

In the vegan package for R we have function ordisurf() for this. It fits a 2-d surface to an NMDS solution using a GAM via function gam() in package mgcv. It essentially fits the model

$$g(\mu_i) = \eta_i = \beta_0 + f(nmds_{1i}, nmds_{2i})$$

where $f$ is a 2-d smooth function of the 1st and 2nd axes of the NMDS, $g$ is the link function, $\mu$ the expectation of the response, and $\eta$ is the linear predictor. The error distribution is a member of the exponential family of distributions. The function $f$ can be isotropic in which case we use smooths formed by s() employing by default thin plate splines. Anisotropic surfaces can be fitted too, where we use smooths formed by te(); tensor product smooths.

The complexity of the smooth is chosen during fitting using a, by default in the development versions, REML criterion. GCV smoothness selection is also available.

Here is an R example using one of the in-built data sets provided with vegan

require("vegan")
data(dune)
data(dune.env)

## fit NMDS using Bray-Curtis dissimilarity (default)
set.seed(12)
sol <- metaMDS(dune)

## NMDS plot
plot(sol)
## Fit and add the 2d surface
sol.s <- ordisurf(sol ~ A1, data = dune.env, method = "REML", 
                  select = TRUE)
## look at the fitted model
summary(sol.s)

This produces

> summary(sol.s)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x1, x2, k = knots[1], bs = bs[1])
<environment: 0x2fb78a0>

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8500     0.4105   11.81 9.65e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value  
s(x1,x2) 1.591      9 0.863  0.0203 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.29   Deviance explained =   35%
REML score = 41.587  Scale est. = 3.3706    n = 20

and

enter image description here

In this case a linear vector fit seems reasonable for this variable. Read ?ordisurf for details on the arguments used, especially what select = TRUE does.

Multiple Regression R – Calculate Variance Explained by Each Predictor in R

The percentage explained depends on the order entered.

If you specify a particular order, you can compute this trivially in R (e.g. via the update and anova functions, see below), but a different order of entry would yield potentially very different answers.

[One possibility might be to average across all orders or something, but it would get unwieldy and might not be answering a particularly useful question.]

As Stat points out, with a single model, if you're after one variable at a time, you can just use 'anova' to produce the incremental sums of squares table. This would follow on from your code:

 anova(fit)
Analysis of Variance Table

Response: dv
          Df   Sum Sq  Mean Sq F value Pr(>F)
iv1        1 0.033989 0.033989  0.7762 0.4281
iv2        1 0.022435 0.022435  0.5123 0.5137
iv3        1 0.003048 0.003048  0.0696 0.8050
iv4        1 0.115143 0.115143  2.6294 0.1802
iv5        1 0.000220 0.000220  0.0050 0.9469
Residuals  4 0.175166 0.043791

So there we have the incremental variance explained; how do we get the proportion?

Pretty trivially, scale them by 1 divided by their sum. (Replace the 1 with 100 for percentage variance explained.)

Here I've displayed it as an added column to the anova table:

 af <- anova(fit)
 afss <- af$"Sum Sq"
 print(cbind(af,PctExp=afss/sum(afss)*100))
          Df       Sum Sq      Mean Sq    F value    Pr(>F)      PctExp
iv1        1 0.0339887640 0.0339887640 0.77615140 0.4280748  9.71107544
iv2        1 0.0224346357 0.0224346357 0.51230677 0.5137026  6.40989591
iv3        1 0.0030477233 0.0030477233 0.06959637 0.8049589  0.87077807
iv4        1 0.1151432643 0.1151432643 2.62935731 0.1802223 32.89807550
iv5        1 0.0002199726 0.0002199726 0.00502319 0.9468997  0.06284931
Residuals  4 0.1751656402 0.0437914100         NA        NA 50.04732577

If you decide you want several particular orders of entry, you can do something even more general like this (which also allows you to enter or remove groups of variables at a time if you wish):

 m5 = fit
 m4 = update(m5, ~ . - iv5)
 m3 = update(m4, ~ . - iv4)
 m2 = update(m3, ~ . - iv3)
 m1 = update(m2, ~ . - iv2)
 m0 = update(m1, ~ . - iv1)

 anova(m0,m1,m2,m3,m4,m5)
Analysis of Variance Table

Model 1: dv ~ 1
Model 2: dv ~ iv1
Model 3: dv ~ iv1 + iv2
Model 4: dv ~ iv1 + iv2 + iv3
Model 5: dv ~ iv1 + iv2 + iv3 + iv4
Model 6: dv ~ iv1 + iv2 + iv3 + iv4 + iv5
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1      9 0.35000                           
2      8 0.31601  1  0.033989 0.7762 0.4281
3      7 0.29358  1  0.022435 0.5123 0.5137
4      6 0.29053  1  0.003048 0.0696 0.8050
5      5 0.17539  1  0.115143 2.6294 0.1802
6      4 0.17517  1  0.000220 0.0050 0.9469

(Such an approach might also be automated, e.g. via loops and the use of get. You can add and remove variables in multiple orders if needed)

... and then scale to percentages as before.

(NB. The fact that I explain how to do these things should not necessarily be taken as advocacy of everything I explain.)