Solved – GAMs with many slightly correlated predictors

generalized-additive-modelmulticollinearitypearson-r

Say I'm constructing a GAM for a response variable R in terms of predictors A, B, C, and D. Something like this (in quasi-R-code):

R ~ s(A) + s(B) + s(C) + s(D)

Before I construct this model, I check for colinearity by calculating Pearson's correlation coefficients. This shows that A is slightly correlated with all of the other predictors (e.g., values of around ±0.3). As the correlation coefficients aren't very high individually, I'm okay with proceeding. The best model according to AIC is R ~ s(A).

Now, I'm interpreting the model results and they don't make a lot of sense physically (i.e., the shape of the relationship between A and R). What I'm concerned about is that the reason the top model was selected is because A is kind of a composite variable of all of the predictors and therefore provides a proxy for all these variables without incurring the penalty of including all of these extra terms.

My question: is there a test/method that accounts for cumulative collinearity across multiple predictors, like this? I'm sure that my terminology is incorrect, so if someone could tell me what I need to Google, that would be a great help.

Best Answer

You need to consider more than just linear correlations among covariates when working with GAMs. You need to consider nonlinear correlations or dependencies. This is known as concurvity in the smoothing/GAM context.

I would argue also that your model selection process is also wrong; I would fit the full model you consider and include some for of shrinkage/regularization to perform selection. I would also consider measures of concurvity for the input data when evaluating the model.

In the penalized GLM approach to GAMs, for example as implemented in the R package mgcv by Simon Wood, a smoothness penalty term that penalizes wiggliness of the fitted smooth as a component of the model penalized likelihood that is maximised during fitting. That smoothness penalty can force a wiggly term back to a linear function. However, the way this penalty works, it cannot act on the infinitely smooth parts of the spline basis expansion. In other words it can't shrink beyond a linear term.

One approach that solves this problem is that of Marra and Wood (2011) — they place a second penalty on the perfectly smooth parts of the basis (the penalty null space). With both penalties in place, the wiggly parts of the basis (the range space) and the perfectly smooth parts of the basis (the null space) can both be subject to shrinkage. The potential effect is that, if warranted a spline's contribution to the fitted model can be shrunk to become effectively, but not quite, zero.

Your approach is making an explicit statement that the effects of the other covariates are exactly equal to zero.

Using a more principled approach to feature selection in GAMs will help avoid you estimated smooths changing too much between candidate models.

You can read a little more about concurvity here in the help page for the R function mgcv::concurvity().

Marra, G. & Wood, S. N. Practical variable selection for generalized additive models. Comput. Stat. Data Anal. 55, 2372–2387 (2011).

Related Question