Generalized Additive Model – Understanding GAM Smoother vs Parametric Term and Concurvity Difference

generalized-additive-modelmgcvmulticollinearity

I have a gam model that is:

 gam=gam(sv~s(day,bs="tp")+s(range,bs="tp")+s(time,bs="cc"),data=train.all,gamma=1.4,method="REML")

the s(range) produces an e.d.f of 1, so I made the model:

gam1=gam(sv~s(day,bs="tp")+range+s(time,bs="cc"),data=train.all,gamma=1.4,method="REML")

There is very high concurvity (~0.85) between day and range in the first model (gam), but that goes away in the gam1 model. I am wondering why that is if s(range) is essentially the same as the parametric form of range. Is the concurvity/collinearity (not sure what to call it between a smoother and parametric term) still there, but simply not calculated by mgcv when it is a parametric term? Or are any co-dependence effects truly removed by simply changing "range" to its parametric form?

Best Answer

The concurvity moves from the stated smooth terms to the parametric terms, which concurvity groups in total under the para column of the matrix or matrices returned.

Here's a modified example from ?concurvity

library("mgcv")
## simulate data with concurvity...
set.seed(8)
n<- 200
f2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 *
            (10 * x)^3 * (1 - x)^10
t <- sort(runif(n)) ## first covariate
## make covariate x a smooth function of t + noise...
x <- f2(t) + rnorm(n)*3
## simulate response dependent on t and x...
y <- sin(4*pi*t) + exp(x/20) + rnorm(n)*.3

## fit model...
b <- gam(y ~ s(t,k=15) + s(x,k=15), method="REML")

Now add a linear term and refit

x2 <- seq_len(n) + rnorm(n)*3
b2 <- update(b, . ~ . + x2)

Now look at the concurvity of the two models

## assess concurvity between each term and `rest of model'...
concurvity(b)
concurvity(b2)

These produce

> concurvity(b)
                para       s(t)      s(x)
worst    1.06587e-24 0.60269087 0.6026909
observed 1.06587e-24 0.09576829 0.5728602
estimate 1.06587e-24 0.24513981 0.4659564
> concurvity(b2)
              para      s(t)      s(x)
worst    0.9990068 0.9970541 0.6042295
observed 0.9990068 0.7866776 0.5733337
estimate 0.9990068 0.9111690 0.4668871

Note that x2 is essentially a noisy version of t:

> cor(t, x2)
[1] 0.9975977

and hence the concurvity is gone up from essentially 0 in b to almost 1 in b2.

Now if we add x2 as a smooth function instead...

concurvity(update(b, . ~ . + s(x2)))

we see that the para entries return to being very small and we get a measure for the spline term s(x2) directly

> concurvity(update(b, . ~ . + s(x2)))
                 para      s(t)      s(x)     s(x2)
worst    1.506201e-24 0.9977153 0.6264654 0.9976988
observed 1.506201e-24 0.9838018 0.5893737 0.9963857
estimate 1.506201e-24 0.9909506 0.4921592 0.9943990

This is just how the function works in terms of the parametric terms; the focus is on the smooth terms.

Note: you are specifying gamma but fitting using REML. gamma only affects GCV and UBRE/AIC methods of smoothness selection, so you can remove this argument as it is having zero effect on the model fits. From version 1.8-23 of mgcv, the gamma argument no also affects models fitted using REML/ML, where smoothness parameters are selected BY REML/ML as if the sample size was $n/\gamma$ instead of $n$.

Related Question