Generalized Additive Model – Concurvity and GAMMs Analysis in R Using MGCV

concurvitygeneralized-additive-modelmgcv

I'm having trouble understanding how to read/interpret the concurvity function output and exactly what to do about high values. My understanding of concurvity is it's the GAM equivalent to collinearity between 2 variables and "high" concurvity values between variables make it difficult to parse out the effects of one from the other on a response variable. I'm using the concurvity() function from mgcv to test for this and I think the values in the "worst" row in concurvity(fit4, full=T) are supposed to match with the "$worst" output from concurvity(fit4, full=F) (according to this example) – I'm reading down the matching variable name columns and across the rows by name, but no values seem to match (except the 1.00 value between "Site" and "para", aka my factor "Season"). For example, under 'worst' for DO = 0.3802842, that number (and others I checked) doesn't seem to appear anywhere in the output for concurvity(fit4, full=F).

I'm also not sure what to do about 2 factors (Site and Season) sharing perfect concurvity (1.00) or when any other variables are too high (remove one of them from the model?). The data is too large to share and I welcome any general comments as well!

Similar question here.

> str(toad)
        'data.frame':   1299 obs. of  41 variables:
         $ DateTime             : POSIXct, format: "2008-01-10 12:04:00" "2008-01-10 12:13:00" ...
         $ CYR                  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
         $ Season               : Factor w/ 2 levels "DRY","WET": 1 1 1 1 1 1 1 1 1 1 ...
         $ Site                 : Factor w/ 47 levels "1","2","3","4",..: 10 11 12 13 14 15 16 17 9 1 ...
         $ Latitude             : num  25.6 25.6 25.6 25.6 25.6 ...
         $ Longitude            : num  -80.3 -80.3 -80.3 -80.3 -80.3 ...
         $ occur                : int  0 0 0 0 0 1 0 0 0 0 ...
         $ temp                 : num  23.3 23.7 22.8 22.6 22.7 ...
         $ DO                   : num  6.18 7.56 5.62 5.85 6.49 6.29 5.21 5.36 6.66 5.9 ...
         $ sal                  : num  30.7 30.9 30.2 30 29.8 ...
         $ water_depth          : int  80 72 98 95 97 85 78 90 58 74 ...
         $ sed_depth            : num  23 48 12 32 43 45 49 32 23 90 ...
         $ total_ave_sg         : num  2.52 23.3 1.8 19.6 3.01 ...
        


                  # occur = occurrence of a species      

        fit4 <- gam(occur ~ s(DO) + # <- Dissolved oxygen
                        s(sed_depth) + # <- depth
                        s(total_ave_sg) + # <- percent cover (0-100%)
                        Season +
                        s(CYR, bs = "cr") + # <- Calendar year
                        s(Site, bs = "re"), # <- Site/location as a random effect
                      family = binomial(link = "logit"),
                      method = "REML",
                      select = TRUE,
                      data = toad)
        
        >   concurvity(fit4, full=T)
                 para     s(DO) s(sed_depth) s(total_ave_sg)     s(CYR)    s(Site)
        worst       1 0.3802842    0.6217770       0.2986150 0.23531051 1.00000000
        observed    1 0.1918930    0.5725485       0.2203862 0.06878478 0.09344120
        estimate    1 0.2769450    0.4834160       0.1950572 0.07738156 0.07391145
        
    
        >   concurvity(fit4, full=F)
        $worst
                                para        s(DO) s(sed_depth) s(total_ave_sg)       s(CYR)     s(Site)
        para            1.000000e+00 5.036488e-24 2.275604e-24    4.310459e-23 4.962884e-29 1.000000000
        s(DO)           5.043914e-24 1.000000e+00 2.530075e-02    2.765640e-02 1.336079e-01 0.064033858
        s(sed_depth)    2.275526e-24 2.530075e-02 1.000000e+00    2.315783e-02 7.889635e-02 0.567590052
        s(total_ave_sg) 4.310350e-23 2.765640e-02 2.315783e-02    1.000000e+00 5.784493e-02 0.204371114
        s(CYR)          5.003954e-29 1.336079e-01 7.889635e-02    5.784493e-02 1.000000e+00 0.008012877
        s(Site)         1.000000e+00 6.403386e-02 5.675901e-01    2.043711e-01 8.012877e-03 1.000000000
        
        $observed
                                para        s(DO) s(sed_depth) s(total_ave_sg)       s(CYR)      s(Site)
        para            1.000000e+00 3.254358e-29 1.216070e-30    2.341476e-29 1.862720e-32 0.0002245256
        s(DO)           5.043914e-24 1.000000e+00 8.695274e-03    8.780887e-03 9.464506e-03 0.0138762214
        s(sed_depth)    2.275526e-24 1.304780e-02 1.000000e+00    1.451928e-02 1.517301e-02 0.0749805944
        s(total_ave_sg) 4.310350e-23 8.159973e-03 1.655912e-02    1.000000e+00 2.231508e-02 0.0057132166
        s(CYR)          5.003954e-29 6.791088e-02 6.209756e-02    2.708953e-02 1.000000e+00 0.0005340506
        s(Site)         1.000000e+00 4.310547e-02 4.945183e-01    1.643716e-01 1.663373e-03 1.0000000000
        
        $estimate
                                para        s(DO) s(sed_depth) s(total_ave_sg)       s(CYR)      s(Site)
        para            1.000000e+00 2.270658e-26 1.012886e-26    2.278115e-25 4.514521e-30 0.0212936937
        s(DO)           5.043914e-24 1.000000e+00 8.352295e-03    1.397560e-02 2.926287e-02 0.0069019311
        s(sed_depth)    2.275526e-24 8.701278e-03 1.000000e+00    1.515966e-02 1.341093e-02 0.0320347100
        s(total_ave_sg) 4.310350e-23 1.286278e-02 1.030622e-02    1.000000e+00 1.374981e-02 0.0122544865
        s(CYR)          5.003954e-29 9.722621e-02 5.657602e-02    2.371417e-02 1.000000e+00 0.0003069276
        s(Site)         1.000000e+00 3.677024e-02 4.125038e-01    1.235044e-01 1.300316e-03 1.0000000000

Best Answer

This was the answer I found:

One is doing X1 = f(X2) + f(X3) + f(X4) and the other... X1 = f(X2), X1 = f(X3), X1 = f(X4)...etc

The first one, with full = TRUE is similar to the VIF. The second one, with full = FALSE is like the Pearson correlation; it does pair-wise stuff. They are NOT supposed to match. High concurvity (=1.0) of parametric factors with low levels doesn't matter and other variables with high-ish concurvity can be taken out of the model.

Related Question