Solved – GAM with categorical variables – interpretation

categorical datageneralized-additive-modelinterpretationmgcv

I want to use GAM to analyze my experimental data. In my experiment, participants basically play a game for 40 experimental years. In total I have 6 different conditions and I have a between-subject design: one participant only experiences one condition.

I want to analyze a variable which is the output of participants' annual decisions. What I want to show is that:

  1. Year has an impact on this output variable for all the conditions.
  2. The path of this output variable over 40 experimental years is significantly different across 6 conditions.

To this end, I run the following model and get this result:

enter image description here

Abbr is my categorical variable which has 6 levels.
My question is mostly on the interpretation. The "Approximate significance of smooth terms" part of the table tells me that Year has a significant effect on pr in all conditions. Is it correct?

What does the "Parametric coefficients" part tell me?
How can I compare these 6 conditions and tell that the paths are significantly different?

And this is the plot of GAM:
enter image description here


Thank you for the great reply! It is much more clear now in terms of smooths.

What I try to do is to test that the smooths are significantly different from each other for my 6 categories. To this end, I followed your blog post and end up with ordered factors with the following model and the results, where OC is my ordered categorical variable:
enter image description here

Now, I can say that there is a significant difference between the smooths of F-G1P1 and F-G5P1.
And when I plot, I get the following:
enter image description here
The first plot gives me the smooth for my category F (reference category), while others are the difference smooths. How should I interpret them?

Now, I would like to extend this analysis to have a pairwise comparison between my 6 categories. To do this, I try to follow this post but I didn't succeed. Do all my categories need to have the same number of observations for these comparisons?

Best Answer

In a factor by variable smooth, like other simple smooths, the bases for the smooths are subject to identifiability constraints. If you just naively computed the basis of the required dimension, and given the defaults for s(), you'd get 2 basis functions that are in the null space of the smoothness penalty:

  1. a flat, horizontal function, and
  2. a linear functions

Both are perfectly smooth and not penalised by the smoothness penalty as a result. The flat function is the same thing as the model intercept. The identifiability issues arises because you could add any value to the estimated coefficient for the intercept (constant) term and subtract the same value from the coefficient for the flat, horizontal basis function, and get the same fit but via a new model. As there are an infinite set of numbers you could add to the intercept you have an infinity of models.

This is not good, so to alleviate the issue an identifiability constraint is used. There are several such constraints but the one that leads to good confidence interval coverage properties is the sum-to-zero constraint. Over the range of the covariate, the smooth is constrained to sum to zero. This means it is centred about zero and this means the flat function is deleted from the basis of the smooth.

Now, in the case of factor by variables, because each smooth is centred about zero, the smooth itself contains no easy way to control for differences between the levels in terms of the mean response; say samples from condition F were on average having larger values of pr than condition G1. We'd want the spline for F to be shifted up by some constant amount relative to G1. That's what the parametric terms are and they come from the + as.factor(Abbr) term in the model formula. The parametric terms represent the deviation of the indicated group from the mean of the reference group (in your case the level not listed, F). If you didn't include this term in the model, then the smooths may become more wiggly as they try to account for the mean shifts of the groups, which is not something you want.

The other main type of smooth you might use for this kind of model is the random factor smooth basis bs = "fs". This basis/smooth includes intercepts for each level of the grouping factor and as such doesn't need the parametric terms.

The approximate significance of the smooths actually represents a test that the indicated smooth is actually a flat, zero, function. Or put another way, it is the smooth equivalent of a t or Wald Z test of the null hypothesis that a coefficient in a linear model or GLM is equal to zero (i.e. has no effect). There is strong evidence against the null for each of your smooths, which is reflected in the strong non-linearity of the estimated smooths and that the confidence intervals for the smooths do not include 0 for most of the range of Year.