Solved – Limitations to generalized additive model (GAM)

generalized-additive-modelnonparametricr

I don't quite understand the generalized additive model behind the GAM package in R. It seems quite powerful with the ability to easily find complex relationships and confidence intervals for these as seen in the R Graphical Manual. Are there any big limitations to these models and is this why I cannot find an implementation in sklearn for Python?

Best Answer

As mentioned in the comments, a propensity to overfit is a limitation of GAMs. Another limitation is that the model will lose predictability when the smoothed variables have values outside of the range of training dataset. Essentially, you are sacrificing predictability outside of your data range for precision within your data range.

Related Solutions

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

This doesn't exactly answer your question, but it might still solve your problem of needing to calculate risk ratios. The epiR package allows you to calculate risk ratios.

I could not get your example to work (see my comment to your question), so here is an example from the package's documentation:

library(epiR) # Used for Risk ratio
library(MASS) # Used for data

dat1 <- birthwt; head(dat1)

## Generate a table of cell frequencies. First set the levels of the outcome
## and the exposure so the frequencies in the 2 by 2 table come out in the
## conventional format:
dat1$low <- factor(dat1$low, levels = c(1,0))
dat1$smoke <- factor(dat1$smoke, levels = c(1,0))
dat1$race <- factor(dat1$race, levels = c(1,2,3))
## Generate the 2 by 2 table. Exposure (rows) = smoke. Outcome (columns) = low.
tab1 <- table(dat1$smoke, dat1$low, dnn = c("Smoke", "Low BW"))
print(tab1)
## Compute the incidence risk ratio and other measures of association:
epi.2by2(dat = tab1, method = "cohort.count", 
conf.level = 0.95, units = 100, outcome = "as.columns")

Solved – GAM (mgcv): AIC vs Deviance Explained

The respective formulas for these two quantities are: $$\text{deviance} = 2\log\mathcal{L}(\text{saturated model}\, |\, \text{data}) - 2\log\mathcal{L}(\text{model}\, |\, \text{data})$$ $$\text{AIC} = 2k- 2\log\mathcal{L}(\text{model}\, |\, \text{data})$$ where $\mathcal{L}$ is the likelihood and $k$ is the number of model parameters. For a fixed dataset and model family, the saturated model is fixed, and therefore for our purposes the equation for deviance is: $$\text{deviance} = \text{constant} - 2\log\mathcal{L}(\text{model}\, |\, \text{data})$$

Plotting AIC against deviance the way that you've done, we expect the data to fall along a straight line if there exist constants $c_1$ and $c_2$ such that: $$c_1 \cdot \text{AIC} + c_2 \approx \text{Deviance}$$

This can only be the case if $k \propto \log\mathcal{L}$. Although this is not a relationship that I have previously come across, it seems plausible.

However it could also be that a different formula for Deviance is being used altogether, as intimated here.

Best Answer

Related Solutions

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

Solved – GAM (mgcv): AIC vs Deviance Explained

Related Question