Solved – Hard in calculating predictor‘s Relative Importance for GAM

generalized-additive-modelimportancenonlinear regressionregression

Although there is no agreement upon "relative importance for predictors" with (even) linear models (one possible definition: lmg method), I would still want to know whether there are some acceptable methods to do it, if I build a Generalized Additive Model.

It's a natural question about which predictor is more important or useful (quantitatively, e.g., using percentage), isn't it?

I found relaimpo package can calculate several relative importance metrics for the linear model, but it can not handle GAM models (see Here).
Here is an example:

library(relaimpo)
library(mgcv)
gam1 <- gam(mpg ~ s(drat) + s(wt) + s(qsec), data = mtcars, method = "REML")
summary(gam1)

From the summary() result, we can see which predictor is "significant" by p-value:

Approximate significance of smooth terms:
          edf Ref.df      F  p-value    
s(drat) 1.000  1.000  0.523 0.476069    
s(wt)   2.487  3.028 21.950 1.59e-08 ***
s(qsec) 1.000  1.000 15.241 0.000545 ***

But we don't know their "relative importance", for example, can we get the following information?

`wt` has a relative importance of 60%, 
`qsec` has a relative importance of 30%, 
`drat` has a relative importance of 10%.

What's worse, because GAM doesn't have a real R-squared, I suppose lmg method cannot be applied.

Best Answer

The caret package provides one answer. With the default tuneGrid and trainControl,

library(caret)
data("mtcars")
gam1 <- train(
  mpg ~ drat + wt + qsec, 
  data = mtcars, 
  method = "gam"
)

and you can then apply varImp.

varImp(gam1)
## gam variable importance
##      Overall
## wt     100.0
## qsec    26.4
## drat     0.0

For sort of the percentage-idea that you wanted, you can resize the returned object:

library(dplyr)
x <- varImp(gam1)
x$importance %>%
  mutate(
    Variable = rownames(.), Overall = Overall / sum(Overall) * 100
  ) %>% 
  arrange(desc(Overall)) %>%
  select(Variable, Overall)
##   Variable Overall
## 1       wt   79.09
## 2     qsec   20.91
## 3     drat    0.00

Because the default will not tune splines or degrees of freedom, you should check how to do these in the caret package. The method = 'gam' will call the mgcv package, but there are plenty other options. For instance if you used method = 'gamSpline', it would tune over the degrees of freedom, and give a different varImp result.

Be wary of what caret is doing under the hood, however---if there are not many distinct values in a predictor, it may turn the term into linear.

Related Solutions

Solved – Which variable relative importance method to use

I prefer to compute the proportion of explainable log-likelihood that is explained by each variable. For OLS models the rms package makes this easy:

f <- ols(y ~ x1 + x2 + pol(x3, 2) + rcs(x4, 5) + ...)
plot(anova(f), what='proportion chisq')
# also try what='proportion R2'

The default for plot(anova()) is to display the Wald $\chi^2$ statistic minus its degrees of freedom for assessing the partial effect of each variable. Even though this is not scaled $[0,1]$ it is probably the best method in general because it penalizes a variable requiring a large number of parameters to achieve the $\chi^2$. For example, a categorical predictor with 5 levels will have 4 d.f. and a continuous predictor modeled as a restricted cubic spline function with 5 knots will have 4 d.f.

If a predictor interacts with any other predictor(s), the $\chi^2$ and partial $R^2$ measures combine the appropriate interaction effects with main effects. For example if the model was y ~ pol(age,2) * sex the statistic for sex is the combined effects of sex as a main effect plus the effect modification that sex provides for the age effect. This is an assessment of whether there is a difference between the sexes for any age.

Methods such as random forests, which do not favor additive effects, are not likelihood based, and use multiple trees, require a different notion of variable importance.

Solved – Calculating relative importance of predictors in a poisson glm model

There are many possible ways to estimate relative importance as Ulrike Gromping, the developer of RELAIMPO, documents in her papers on approaches to estimating this metric. Her method and accompanying R module is one of the more sophisticated. Your first option is to recognize the nonconfirmatory and proxy nature of all of these approaches -- they are all approximations. Given that, why not ignore the gaussian assumptions of RELAIMPO and run your model through the package?

To @jay 's point, you can't analyze coefficients wrt relative importance that are expressed in the scale of their predictors. With that in mind, another approach would be to employ a widely used practice in classic, OLS regression, standardize the predictors and analyze the absolute values of the resulting coefficients.

Yet another approach would be to take the absolute values of, in your model, the Z-statistics, sum them up and then repercentage each abs parameter with that total. By ranking those relativized percentages, a viable heuristic for relative importance can be easily obtained.

Best Answer

Related Solutions

Solved – Which variable relative importance method to use

Solved – Calculating relative importance of predictors in a poisson glm model

Related Question