Predictive Models – Beyond Interpretability: Why Use Additive Models Over Complex Smoothers?

computational-statisticsgeneralized-additive-modelinterpretationpredictive-modelssmoothing

Let's adopt for the second the notation from the R package mgcv.

library(mgcv)

dat <- gamSim(scale = 3, verbose = FALSE)  
        # Generate a dataset
additive_model <- gam(y ~ s(x0) + s(x1) + s(x2), 
                      data = dat)
spline_model <- gam(y ~ te(x0, x1, x2), 
                    data = dat)

One clear advantage of additive models (additive_model above) over smoothers (Gaussian process regression, splines, etc., spline_model above) is interpretability; by imposing an additive structure, the resulting univariate functions are more easily interpreted if only because they're easily plotted. The complex model without this structure is much more difficult to interpret, especially when there's more than three (or even two) inputs.

Is that the only advantage of the additive model, though? If all we want are good predictions and good statistical fit (and especially without thinking hard about what the right additive model is), would we ever choose the additive model over the complex multivariate smooother? Or can the additive model beat the complex spline or Gaussian process in predictive ability or computational feasibility?

Best Answer

There's a problem with multivariate smoothing in high dimensions. Conceptually, a multivariate smooth predicts $f(x_0)$ using data $(x,y)$ where $x$ is 'close' to $x_0$ in every dimension. An additive model uses data where $x$ is close to $x_0$ in at least one dimension.

When you have $n$ points in $d$ dimensional space, for something like $d\gg\log n$, there aren't any data points $(x,y)$ with $x$ close to $x_0$ in every dimension. To get your multivariate smoother to work, you need to increase the smoothing window -- potentially by a lot.

So, one tradeoff is between an additive model with a relatively small smoothing window and a multivariate smoother with a larger smoothing window. (You can compromise, and take an additive model with low-dimensional rather than one-dimensional smoothing components, but it is a compromise.)

Another way of looking at the same issue is that an additive model has a much smaller space of possible fits (once you fix the smoothing window size). It's less flexible. This is bad if the true response function isn't close to any of the possibilities in the additive model, because your fit will be bad. It's good if the true response function is close to some of the possibilities in the additive model, because you'll get accurate prediction with less overfitting.