I'm having trouble understanding how to read/interpret the concurvity function output and exactly what to do about high values. My understanding of concurvity is it's the GAM equivalent to collinearity between 2 variables and "high" concurvity values between variables make it difficult to parse out the effects of one from the other on a response variable. I'm using the concurvity() function from mgcv to test for this and I think the values in the "worst" row in concurvity(fit4, full=T) are supposed to match with the "$worst" output from concurvity(fit4, full=F) (according to this example) – I'm reading down the matching variable name columns and across the rows by name, but no values seem to match (except the 1.00 value between "Site" and "para", aka my factor "Season"). For example, under 'worst' for DO = 0.3802842, that number (and others I checked) doesn't seem to appear anywhere in the output for concurvity(fit4, full=F).
I'm also not sure what to do about 2 factors (Site and Season) sharing perfect concurvity (1.00) or when any other variables are too high (remove one of them from the model?). The data is too large to share and I welcome any general comments as well!
Similar question here.
> str(toad)
'data.frame': 1299 obs. of 41 variables:
$ DateTime : POSIXct, format: "2008-01-10 12:04:00" "2008-01-10 12:13:00" ...
$ CYR : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Season : Factor w/ 2 levels "DRY","WET": 1 1 1 1 1 1 1 1 1 1 ...
$ Site : Factor w/ 47 levels "1","2","3","4",..: 10 11 12 13 14 15 16 17 9 1 ...
$ Latitude : num 25.6 25.6 25.6 25.6 25.6 ...
$ Longitude : num -80.3 -80.3 -80.3 -80.3 -80.3 ...
$ occur : int 0 0 0 0 0 1 0 0 0 0 ...
$ temp : num 23.3 23.7 22.8 22.6 22.7 ...
$ DO : num 6.18 7.56 5.62 5.85 6.49 6.29 5.21 5.36 6.66 5.9 ...
$ sal : num 30.7 30.9 30.2 30 29.8 ...
$ water_depth : int 80 72 98 95 97 85 78 90 58 74 ...
$ sed_depth : num 23 48 12 32 43 45 49 32 23 90 ...
$ total_ave_sg : num 2.52 23.3 1.8 19.6 3.01 ...
# occur = occurrence of a species
fit4 <- gam(occur ~ s(DO) + # <- Dissolved oxygen
s(sed_depth) + # <- depth
s(total_ave_sg) + # <- percent cover (0-100%)
Season +
s(CYR, bs = "cr") + # <- Calendar year
s(Site, bs = "re"), # <- Site/location as a random effect
family = binomial(link = "logit"),
method = "REML",
select = TRUE,
data = toad)
> concurvity(fit4, full=T)
para s(DO) s(sed_depth) s(total_ave_sg) s(CYR) s(Site)
worst 1 0.3802842 0.6217770 0.2986150 0.23531051 1.00000000
observed 1 0.1918930 0.5725485 0.2203862 0.06878478 0.09344120
estimate 1 0.2769450 0.4834160 0.1950572 0.07738156 0.07391145
> concurvity(fit4, full=F)
$worst
para s(DO) s(sed_depth) s(total_ave_sg) s(CYR) s(Site)
para 1.000000e+00 5.036488e-24 2.275604e-24 4.310459e-23 4.962884e-29 1.000000000
s(DO) 5.043914e-24 1.000000e+00 2.530075e-02 2.765640e-02 1.336079e-01 0.064033858
s(sed_depth) 2.275526e-24 2.530075e-02 1.000000e+00 2.315783e-02 7.889635e-02 0.567590052
s(total_ave_sg) 4.310350e-23 2.765640e-02 2.315783e-02 1.000000e+00 5.784493e-02 0.204371114
s(CYR) 5.003954e-29 1.336079e-01 7.889635e-02 5.784493e-02 1.000000e+00 0.008012877
s(Site) 1.000000e+00 6.403386e-02 5.675901e-01 2.043711e-01 8.012877e-03 1.000000000
$observed
para s(DO) s(sed_depth) s(total_ave_sg) s(CYR) s(Site)
para 1.000000e+00 3.254358e-29 1.216070e-30 2.341476e-29 1.862720e-32 0.0002245256
s(DO) 5.043914e-24 1.000000e+00 8.695274e-03 8.780887e-03 9.464506e-03 0.0138762214
s(sed_depth) 2.275526e-24 1.304780e-02 1.000000e+00 1.451928e-02 1.517301e-02 0.0749805944
s(total_ave_sg) 4.310350e-23 8.159973e-03 1.655912e-02 1.000000e+00 2.231508e-02 0.0057132166
s(CYR) 5.003954e-29 6.791088e-02 6.209756e-02 2.708953e-02 1.000000e+00 0.0005340506
s(Site) 1.000000e+00 4.310547e-02 4.945183e-01 1.643716e-01 1.663373e-03 1.0000000000
$estimate
para s(DO) s(sed_depth) s(total_ave_sg) s(CYR) s(Site)
para 1.000000e+00 2.270658e-26 1.012886e-26 2.278115e-25 4.514521e-30 0.0212936937
s(DO) 5.043914e-24 1.000000e+00 8.352295e-03 1.397560e-02 2.926287e-02 0.0069019311
s(sed_depth) 2.275526e-24 8.701278e-03 1.000000e+00 1.515966e-02 1.341093e-02 0.0320347100
s(total_ave_sg) 4.310350e-23 1.286278e-02 1.030622e-02 1.000000e+00 1.374981e-02 0.0122544865
s(CYR) 5.003954e-29 9.722621e-02 5.657602e-02 2.371417e-02 1.000000e+00 0.0003069276
s(Site) 1.000000e+00 3.677024e-02 4.125038e-01 1.235044e-01 1.300316e-03 1.0000000000
Best Answer
This was the answer I found:
One is doing X1 = f(X2) + f(X3) + f(X4) and the other... X1 = f(X2), X1 = f(X3), X1 = f(X4)...etc
The first one, with full = TRUE is similar to the VIF. The second one, with full = FALSE is like the Pearson correlation; it does pair-wise stuff. They are NOT supposed to match. High concurvity (=1.0) of parametric factors with low levels doesn't matter and other variables with high-ish concurvity can be taken out of the model.