Find difference in means at particular points in a by-factor GAMM

data visualizationgeneralized-additive-modelmgcvr

This is a bit long but I wanted to give enough photos and code to explain what I'm doing and the issues I'm having. The main question is in the last paragraph.

I'm examining differences in male and female growth in a wildlife species using GAMMs, and while the GAMM output compares male and female growth across the curve (via the parametric coefficient telling me the significance of the partial effect of sex), I am wondering how I can compare the difference at different ages/x values. It was suggested I use this method because I have some repeat measures of individuals but not all repeats (we captured and measured whoever we saw on measurement day each year, so some animals are measured once in their life and some are represented in >5 age categories).

I've tried a few methods but all are giving me a different answer than my GAM is, and I'm wondering which is correct.

 My code is as follows: 

gamm(data = data, Weight ~ Sex + s(Age, by = Sex), random = list(SubjectID = ~1), method = "REML")

This is the output:

GAMM output

And here are the effects plots: Effects Plots

(Note that some of the graphs and outputs say "AgeReset10Up", this is just because I have a few age categories, I changed it to "Age" in a few graphs and in this description for ease of reading but they're the same thing).

The parametric coefficient and partial effects plot tells me that there is a significant effect of sex, saying that over the growth curve females weigh more than males, but I want to know at what ages this is actually true, so that I can say for example females are significantly bigger than males at age x but not age y. I tried to find this out in three ways, but all say that there are no differences.

First, when I make a ggplot of the raw data and the GAM line overlapping, the CIs overlap at all points, which seems to indicate there is no significant difference in weight at any age.
GGPlot

Second, when I run a linear regression
lmer(Weight ~ Sex*Age + (1|SubjectID), data = data), there is no point at which the age*sex interaction that is significant, see below. Running the LMER was suggested to me as a way to find difference in means at each age, kind of like running individual t-tests (the reference age is age 10 btw). 
LMER output

And third, I had used Gavin Simpson's awesome function and tutorial to find the differences in the smooths, and when I posted a question about it he kindly suggested that I modify that code by not zeroing out the intercept and parametric lines and that that would give me a difference in means when the CI excludes 0, like in the post. I ran that, and there is no point at which the CI excludes 0, suggesting the means are the same at every point.

How is it that there's an overall significant effect of sex, with females bigger than males in the GAM, but when examining the actual mean at each age using several different methods there's no difference? A friend who specializes in GAMs said that the GAM output saying there's a significant difference is right, and that the other methods must be off. So am I just misunderstanding something? Is there a better way to find means?

Best Answer

I believe this

... and there is no point at which the CI excludes 0, suggesting the means are the same at every point

is one source of confusion.

On the face of it, the data and smooths suggest that the difference of means is not exactly zero. Put another way, whatever difference in means there is between the two groups is small relative to the uncertainty in the estimates of those mean. You observed a difference but cannot say with any level of confidence that this observed difference is a real difference in the population of subjects you might have sampled.

Why are each of the model terms significant but the difference not significant? This comes from propagating the uncertainty in all of the things you estimated into the computation of the difference of means as a function of age.

Controlling for the smooth effect of Age by sex, you detect a small difference between the two groups in the mean response. The two smooth terms were assessed to be unlikely to have arise if the true functions were flat constant functions.

Computing a difference of means as a function of age requires you to bring together all these estimated effects and propagate their uncertainties into your estimates of the difference in the response between the two groups as a function of age.

You could try a more direct estimation of the thing you are interested in by using the ordered-factor method.

data <- transform(data, oSex = ordered(Sex))
m <- gam(Weight ~ oSex + s(Age) + s(Age, by = oSex) + s(Subject, bs = "re"),
         method = "REML"

where s(Age) is the smooth effect of Age for the reference level of Sex, while s(Age, by = oSex) is the smooth difference between the smooth effect of Age in the reference level and the other level of Sex.

This shouldn't change the outcome much though; I suspect you'll see a small effect of Sex overall once you condition on the smooth effects of Age in both groups, but you won't find large ("significant") differences when you ask the more specific question of "At what ages are the groups different?".

Related Question