Solved – Variance explained from factors for step-wise regression in R

rregression

I used a stepwise multiple regression to generate a model and I am trying to appropriately report the results by indicating the (additional) variance explained by each included factor. I'm not sure how to extract that information, however.

Basically my code thus far is simply:

init <- lm(dep ~ fac1 + fac2 + fac3 + fac4,data=data)
final <- stepAIC(init, direction="both")

final <- step(init, direction="both")

In the case of my data, the final model is final ~ fac1 + fac2 + fac3

At this point, I want to know how much variance the most important factor explains, how much additional variance is explained by adding a second factor and so on. I can get the coefficients, but I'm not sure how to extract the R/R^2 for each factor.

Where is that reported?

Thanks in advance.

Best Answer

You're unlikely to get legitimate answers to your questions using stepwise algorithms to select predictors. For details on that topic you could search for "variable selection" on this site. If you're willing and able to use a more intentional/focussed way of choosing variables, then R's relaimpo (relative importance) package should be very helpful. Its calc.relimp command calculates the change in r-squared for each predictor when the predictor is entered last (its [part r] squared, a.k.a. squared semipartial correlation) -- and/or when it is entered first (its zero-order r squared). A basic statement is

calc.relimp( mymodel, type = c("last", "first") )

Related Solutions

Solved – PCA with all categorical factors prior a regression with a continuous response

Neither regression nor ANOVA assume independence of the factors. If the correlation is severe, however, you might consider looking at http://en.wikipedia.org/wiki/Multiple_correspondence_analysis Here is a guide with some of the R packages referenced: http://factominer.free.fr/classical-methods/multiple-correspondence-analysis.html

Multiple Regression R – Calculate Variance Explained by Each Predictor in R

The percentage explained depends on the order entered.

If you specify a particular order, you can compute this trivially in R (e.g. via the update and anova functions, see below), but a different order of entry would yield potentially very different answers.

[One possibility might be to average across all orders or something, but it would get unwieldy and might not be answering a particularly useful question.]

As Stat points out, with a single model, if you're after one variable at a time, you can just use 'anova' to produce the incremental sums of squares table. This would follow on from your code:

 anova(fit)
Analysis of Variance Table

Response: dv
          Df   Sum Sq  Mean Sq F value Pr(>F)
iv1        1 0.033989 0.033989  0.7762 0.4281
iv2        1 0.022435 0.022435  0.5123 0.5137
iv3        1 0.003048 0.003048  0.0696 0.8050
iv4        1 0.115143 0.115143  2.6294 0.1802
iv5        1 0.000220 0.000220  0.0050 0.9469
Residuals  4 0.175166 0.043791

So there we have the incremental variance explained; how do we get the proportion?

Pretty trivially, scale them by 1 divided by their sum. (Replace the 1 with 100 for percentage variance explained.)

Here I've displayed it as an added column to the anova table:

 af <- anova(fit)
 afss <- af$"Sum Sq"
 print(cbind(af,PctExp=afss/sum(afss)*100))
          Df       Sum Sq      Mean Sq    F value    Pr(>F)      PctExp
iv1        1 0.0339887640 0.0339887640 0.77615140 0.4280748  9.71107544
iv2        1 0.0224346357 0.0224346357 0.51230677 0.5137026  6.40989591
iv3        1 0.0030477233 0.0030477233 0.06959637 0.8049589  0.87077807
iv4        1 0.1151432643 0.1151432643 2.62935731 0.1802223 32.89807550
iv5        1 0.0002199726 0.0002199726 0.00502319 0.9468997  0.06284931
Residuals  4 0.1751656402 0.0437914100         NA        NA 50.04732577

If you decide you want several particular orders of entry, you can do something even more general like this (which also allows you to enter or remove groups of variables at a time if you wish):

 m5 = fit
 m4 = update(m5, ~ . - iv5)
 m3 = update(m4, ~ . - iv4)
 m2 = update(m3, ~ . - iv3)
 m1 = update(m2, ~ . - iv2)
 m0 = update(m1, ~ . - iv1)

 anova(m0,m1,m2,m3,m4,m5)
Analysis of Variance Table

Model 1: dv ~ 1
Model 2: dv ~ iv1
Model 3: dv ~ iv1 + iv2
Model 4: dv ~ iv1 + iv2 + iv3
Model 5: dv ~ iv1 + iv2 + iv3 + iv4
Model 6: dv ~ iv1 + iv2 + iv3 + iv4 + iv5
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1      9 0.35000                           
2      8 0.31601  1  0.033989 0.7762 0.4281
3      7 0.29358  1  0.022435 0.5123 0.5137
4      6 0.29053  1  0.003048 0.0696 0.8050
5      5 0.17539  1  0.115143 2.6294 0.1802
6      4 0.17517  1  0.000220 0.0050 0.9469

(Such an approach might also be automated, e.g. via loops and the use of get. You can add and remove variables in multiple orders if needed)

... and then scale to percentages as before.

(NB. The fact that I explain how to do these things should not necessarily be taken as advocacy of everything I explain.)

Best Answer

Related Solutions

Solved – PCA with all categorical factors prior a regression with a continuous response

Multiple Regression R – Calculate Variance Explained by Each Predictor in R

Related Question