Solved – Does GAM (Generalized Additive Model) have collinearity problem

generalized linear modelgeneralized-additive-model

There is collinearity effect in linear regression methods. For example, this question is about collinear predictors in GLM.
But GAMs are nonlinear, do we need to check the collinearity of independent variables before using GAMs?

Best Answer

GAM models can be afflicted by concurvity (the extension of GLM collinearity to GAM models).

According to https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/concurvity.html:

"Concurvity occurs when some smooth term in a model could be approximated by     
one or more of the other smooth terms in the model. This is often the case     
when a smooth of space is included in a model, along with smooths of other      
covariates that also vary more or less smoothly in space. Similarly it tends   
to be an issue in models including a smooth of time, along with smooths of 
other time varying covariates.

Concurvity can be viewed as a generalization of co-linearity, and causes   
similar problems of interpretation. It can also make estimates somewhat 
unstable (so that they become sensitive to apparently innocuous modelling   
details, for example)."

The above link explains how you can compute three different measures of concurvity for a GAM model fitted with the mgcv package in R, all of which are bounded between 0 and 1 (with 0 indicating no concurvity).

Thus, you do have to check for the potential presence of concurvity in your GAM models by computing appropriate measures of concurvity and making sure they are not too high (i.e., not too close to 1). See also gam smoother vs parametric term (concurvity difference), https://jroy042.github.io/nonlinear/week3.html and https://eric-pedersen.github.io/mgcv-esa-workshop/slides/02-model_checking.html#/.

Related Solutions

Solved – Model construction: How to build a meaningful gam model? (generalized additive model)

The reason for the first error is that bio.percent_b500 doesn't have k - 1 unique values. If you set k to something lower for this spline then the model will fit. Why the 2-d thin-plate version works I think, IIRC, is because of the way it calculates a default for k when not supplied and it must do this from the data. So for model1, setting k = 9 works:

> model1 <- gam(honey.mean ~ s(week) + s(bio.percent_b500, k = 9), data = df)

In mgcv, a model with s(x1, x2) includes the main effects and the interaction. Note the s() assumes similar scaling in the variables, so I doubt you really want s() - try te() instead for a tensor-product smooth. If you want a decomposition into main effects and interaction then try:

model2 <- gam(honey.mean ~ ti(week) + ti(bio.percent_b500) + 
                ti(week, bio.percent_b500), data = df)

From the summary() it suggests the interaction is not required:

> summary(model2)

Family: gaussian 
Link function: identity 

Formula:
honey.mean ~ ti(week) + ti(bio.percent_b500) + ti(week, bio.percent_b500)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.2910     0.5263   23.35   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                            edf Ref.df      F p-value    
ti(week)                  3.753  3.963 22.572  <2e-16 ***
ti(bio.percent_b500)      3.833  3.974  2.250  0.0461 *  
ti(week,bio.percent_b500) 1.246  1.448  1.299  0.2036    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.25   Deviance explained = 27.3%
GCV = 84.697  Scale est. = 81.884    n = 296

And you get similar information from the generalised likelihood ratio test"

> model1 <- gam(honey.mean ~ ti(week) + ti(bio.percent_b500, k = 9), data = df)
> anova(model1, model2, test = "LRT")
Analysis of Deviance Table

Model 1: honey.mean ~ ti(week) + ti(bio.percent_b500, k = 9)
Model 2: honey.mean ~ ti(week) + ti(bio.percent_b500) + ti(week, bio.percent_b500)
  Resid. Df Resid. Dev       Df Deviance Pr(>Chi)
1    285.65      23363                           
2    286.17      23433 -0.51811  -69.675   0.1831

Here ti() terms are tensor-product interaction terms in which the main/marginal effects for other terms in the model are removed. You don't strictly need ti() I think in model1, te() or even s() should work, but the examples in mgcv all now use this form, so I'm going with that.

model3 doesn't make much sense to me, especially if you use tensor products; the s(week) term is included in the s(week, bio.percent_b500), but as I mentioned bivariate s() terms assume isotropy so this may have overly constrained the week component thus leaving space for the s(week) to come in and explain something. In general you shouldn't this is you get the bivariate term right. Whether you can get a true bivariate term right with your 500m variable isn't clear to me.

Questions:

Q1

I doubt you want an interaction without main/marginal effects. Your model2 includes both.

Q2

The model may be more complex in terms of the sets of functions you envisage but you need to consider the bases produced for each smooth term, plus mgcv uses splines with penalties which do smoothness selection and hence you could end up with a model that is at first glance more complex, but the smooth terms have been shrunken because of the penalties such that they use fewer degrees of freedom once fitted.

Q3

Technically, only model1 is really correctly specified. I don't think you want to assume isotropy for model2. I think you'll run into fitting problems with model3, model6, and model7 in general; if you set up the bases for the tensor products right, you shouldn't need the separate s(week) terms in those models.

Generalized Additive Models – GAMs, Interactions, and Covariates

Q1 What's the difference between models 3 and 4?

Model 3 is a purely additive model

$$y = \alpha + f_1(x) + f_2(w) + \varepsilon$$

so we have a constant $\alpha$ plus the smooth effect of $x$ plus the smooth effect of $w$.

Model 4 is a smooth interaction of two continuous variables

$$y = \alpha + f_1(x, w) + \varepsilon$$

In a practical sense, Model 3 says that no matter what the effect of $w$, the effect of $x$ on the response is the same; if we fix $x$ at some known value and vary $w$ over some range, the contributions from $f_1(x)$ to the fitted model remains the same. Verify this if you want by predict()-ing from model 3 for a fixed value of x and a few different values of w and use the type = 'terms' argument of the predict() method. You'll see a constant contribution to the fitted/predicted values for s(x).

This is not the case of model 4; this model says that the smooth effect of $x$ varies smoothly with the value of $w$ and vice-versa.

Note that unless $x$ and $w$ are in the same units or we expect the same wiggliness in both variables, you should be using te() to fit the interaction.

m4a <- gam(y ~ te(x, w), data = data.ex, method = 'REML')

pdata <- mutate(data.ex, Fittedm4a = predict(m4a))
ggplot(pdata, aes(x = x, y = y)) +
  geom_point() +
  geom_line(aes(y = Fittedm4a), col = 'red')

In one sense, model 4 is fitting

$$y = \alpha + f_1(x) + f_2(w) + f_3(x, w) + \varepsilon$$

where $f_3$ is the pure smooth interaction of the "main" smooth effects of $x$ and $w$, and which have been removed, for sake of identifiability, from the basis of $f_3$. You can get this model via

m4b <- gam(y ~ ti(x) + ti(w) + ti(x, w), data = data.ex, method = 'REML')

but note this estimates 4 smoothness parameters:

the one associated with the main smooth effect of $x$
the one associated with the main smooth effect of $w$
the one associated with the marginal smooth of $x$ in the interaction tensor product smooth
the one associated with the marginal smooth of $w$ in the interaction tensor product smooth

The te() model contains just two smoothness parameters, one per marginal basis.

A fundamental problem with all these models is that the effect of $w$ is not strictly smooth; there's a discontinuity where the effect of $w$ falls to 0 (or 1 in w2). This is showing up in your plots (and the one I show in detail here).

Q2 What, exactly, is "by" doing here?

by variable smooths can do a number of different things depending on what you pass to the by argument. In your examples the by variable, $w$ is continuous. In that case what you get a varying coefficient model. This is a model in which the linear effect of $w$ varies smoothly with $x$. In equation terms this is what your, say model 5, is doing

$$y = \alpha + f_1(x)w + \varepsilon$$

If this isn't immediately clear (it wasn't to me when I first looked at these models) for some given values of $x$ we evaluate the smooth function at this value and this then become the equivalent of $\beta_1 w$; in other words, it is the linear effect of $w$ at the given value of $x$, and those linear effects vary smoothly with $x$. See section 7.5.3 in the second edition of Simon's book for a concrete example where the linear effect of a covariate varies a smooth function of space (lat and long).

Q3 Why would there be such a notable difference between Models 5 and 9?

The difference between models 5 and 9 I think is simply due to multiplying by 0 or multiplying by 1. With the former, the effect of the only term in the model $f_1(x)w$ is 0 because $f_1(x) \times 0 = 0$. In model 9 you have $f_1(x) \times 1 = f_1(x)$ in those areas where there is no contribution from the gaussians of $w$. As $f_1(x)$ is an ~ exponential function, you get this superimposed upon the overall effect of $w$.

In other words, model 5 contains zero trend everywhere $w$ is 0, but model 9 includes an ~ exponential trend everywhere $w$ is 0 (1), upon which the varying coefficient effect of $w$ is superimposed.