R Programming – Correcting for Multiple Pairwise Comparisons with GAM Objects in R

generalized-additive-modelmgcvmultiple-comparisonspost-hocr

My question concerns pairwise comparisons of factor levels in a gam object. I have a dataframe, df, containing reaction time data (in ms) to stimuli varying in shape:

subjectID     RT     shape
001           501    square
002           722    circle
003           302    square
...

gam (from the mgcv package) offers a nice way to model this data while handling random-effects variables such as subjectID (i.e. controlling for 'random' variability between subjects):

m1 = gam(data = df, formula = lrt ~ Rp * slfreq + s(subjectID, bs = "re")

I can use summary(m1) to view pairwise contrasts of individual conditions (and corresponding p values) within the factor shape, however this method only gives uncorrected p values.

My question is this: is there a method for applying corrections to these p values (for example, a Tukey correction) when dealing with gam objects in R?

Best Answer

The glht() function for generalized linear hypotheses from the multcomp package can be used to carry out various kinds of contrasts using a range of different p-value adjustments. The contrasts you are looking for are also called "Tukey" contrasts for all pairwise comparisons. The p-value adjustments include single-step, Shaffer, Westfall, and all p.adjust methods, see ?summary.glht.

As @GavinSimpson pointed out in the comments: For gam() objects from mgcv this does not work out of the box but requires some manual intervention. For lmer() from lme4 everything works conveniently. I illustrate below how both packages can be used with multcomp to obtain equivalent results. For illustration I use the sleepstudy data from lme4 but collapse the numeric regressor Days to a three-level factor (merely for illustration purposes):

library("lme4")
data("sleepstudy", package = "lme4")
sleepstudy$Days <- cut(sleepstudy$Days, breaks = c(-Inf, 2.5, 5.5, Inf),
  labels = c("low", "med", "high"))
m1 <- lmer(Reaction ~ Days + (1 | Subject), data = sleepstudy)
summary(m1)
## ...
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  262.170      9.802  26.747
## Daysmed       31.217      6.365   4.905
## Dayshigh      67.433      5.954  11.326
## ...

Then glht() can be used to set up all pairwise (aka Tukey) contrasts for the Days factor. The summary() method then applies the p-value adjustment (single-step, by default).

library("multcomp")
g1 <- glht(m1, linfct = mcp(Days = "Tukey"))
summary(g1)
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## Fit: lmer(formula = Reaction ~ Days + (1 | Subject), data = sleepstudy)
## 
## Linear Hypotheses:
##                 Estimate Std. Error z value Pr(>|z|)    
## med - low == 0    31.217      6.365   4.905 2.28e-06 ***
## high - low == 0   67.433      5.954  11.326  < 1e-06 ***
## high - med == 0   36.216      5.954   6.083  < 1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

The same model can be fitted with gam() as described in the question.

library("mgcv")
m2 <- gam(Reaction ~ Days + s(Subject, bs = "re"), data = sleepstudy)
summary(m2)
## ...
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  262.170      9.802  26.747  < 2e-16 ***
## Daysmed       31.217      6.365   4.905 2.27e-06 ***
## Dayshigh      67.433      5.954  11.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ...

However, the mcp(Days = "Tukey") method for describing the Tukey contrasts does not cooperate with gam() output and hence fails:

g2 <- glht(m2, linfct = mcp(Days = "Tukey"))
## Error in linfct[[nm]] %*% C : 
##   requires numeric/complex matrix/vector arguments

However, it is not difficult (albeit a bit technical and tedious) to set up the contrast matrix by hand.

contr <- matrix(0, nrow = 3, ncol = length(coef(m2)))
colnames(contr) <- names(coef(m2))
rownames(contr) <- c("med - low", "high - low", "high - med")
contr[, 2:3] <- rbind(c(1, 0), c(0, 1), c(-1, 1))

The first columns of the contrast matrix show what is needed here: As the low coefficient is constrained to zero in the model, med - low is simply med and analogously for high - low. The last row then shows the contrast for high - med:

contr[, 1:5]
##            (Intercept) Daysmed Dayshigh s(Subject).1 s(Subject).2
## med - low            0       1        0            0            0
## high - low           0       0        1            0            0
## high - med           0      -1        1            0            0

And with this contrast matrix we can conduct the pairwise comparison with glht():

g2 <- glht(m2, linfct = contr)
summary(g2)
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: gam(formula = Reaction ~ Days + s(Subject, bs = "re"), data = sleepstudy)
## 
## Linear Hypotheses:
##                 Estimate Std. Error z value Pr(>|z|)    
## med - low == 0    31.217      6.365   4.905 2.35e-06 ***
## high - low == 0   67.433      5.954  11.326  < 1e-06 ***
## high - med == 0   36.216      5.954   6.083  < 1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Another quite convenient way to indicate the contrasts to be tested is through character strings. This can set up linear functions based on the effect names from names(coef(m2)). And for factors with fewer levels (and hence fewer Tukey contrasts) this works quite nicely - but if the comparisons become more complex it's possibly easier to constract the contrast matrix as above.

g3 <- glht(m2, linfct = c("Daysmed = 0", "Dayshigh = 0", "Dayshigh - Daysmed = 0"))
summary(g3)
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: gam(formula = Reaction ~ Days + s(Subject, bs = "re"), data = sleepstudy)
## 
## Linear Hypotheses:
##                         Estimate Std. Error z value Pr(>|z|)    
## Daysmed == 0              31.217      6.365   4.905 2.53e-06 ***
## Dayshigh == 0             67.433      5.954  11.326  < 1e-06 ***
## Dayshigh - Daysmed == 0   36.216      5.954   6.083  < 1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Older approach

Simon Wood has used the following simple example to check this is working:

library("mgcv")
require("nlme")
dum <- rep(1,18)
b <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
predict(b, newdata=data.frame(Rail="1", dum=0)) ## r.e. "turned off"
predict(b, newdata=data.frame(Rail="1", dum=1)) ## prediction with r.e

Which works for me. Likewise:

dum <- rep(1, NROW(na.omit(Orthodont)))
m <- gam(distance ~ s(age, bs = "re", by = dum) + Sex, data = Orthodont)
predict(m, data.frame(age = 8, Sex = "Female", dum = 1))
predict(m, data.frame(age = 8, Sex = "Female", dum = 0))

also works.

So I would check the data you are supplying in newdata is what you think it is as the problem may not be with VesselID — the error is coming from the function that would have been called by the predict() calls in the examples above, and Rail is a factor in the first example.

Solved – Smoothing methods for gam in mgcv package

mgcv uses a thin plate spline basis as the default basis for it's smooth terms. To be honest it likely makes little difference in many applications which of these you choose, though in some situations or with very large data set sizes, other basis types might be used to good effect. Thin plate splines tend to have better RMSE performance than the other three you mention but are more computationally expensive to set up. Unless you have a reason to use the P or B spline bases, use thin plate splines unless you have a lot of data and if you have a lot of data consider the cubic spline option.

k doesn't set the number of knots, at least not in the default thin plate spline basis. What k does is to set the dimensionality of the basis expansion; you'll end up with k - 1 basis functions. In mgcv Simon Wood does a trick to reduce the rank of basis dimension. IIRC, in the usual thin plate spline basis there is a knot at each data location, but this is wasteful as once you've set up this large basis you end up using far fewer degrees of freedom in the fitted function. What Simon does is to eigen decompose the matrix of basis functions and choose the eigenvectors of the decomposition corresponding to the k - 1 largest eigenvalues. This has the effect of concentrating the main wiggliness "information" of the full basis in a reduced rank form.

The choice of k is important and the default is arbitrary and something you want to check (see gam.check()), but the critical observation is that you want to set k to be large enough to contain the envisioned dimensionality of the underlying function you are trying to recover from the data. In practice, one tends to fit with a modest k given the data set size and then use gam.check() on the resulting model to check if k was large enough. If it wasn't, increase k and refit. Rinse and repeat...

You are most likely going to want to fit the model using REML (or ML) smoothness selection via method = "REML" or method = "ML": this treats the model as a mixed effects one with the wiggly parts of the spline bases being treated as special random effects terms. Simon Wood has shown that REML (or ML) selection performs better than GCV, which can undersmooth in situations where the objective function is flat around the optimal smoothness parameter value.

The ridge penalty mentioned by @generic_user is taken care of for you, so you can ignore this part of setting up the model.

Best Answer

Related Solutions

Generalized Additive Models – Predicting with Random Effects in mgcv gam

Older approach

Solved – Smoothing methods for gam in mgcv package

Related Question