Regression – Discrepancy Between Model Selection Based on REML Score vs. Explained Deviance in GAMs

deviancegeneralized-additive-modelmodel selectionregressionreml

Will be grateful for insights into the issue below!

I have two explanatory GAMs below (in this example implemented with mgcv), where the effects of x1 and x2 are of interest. x1 is air temperature, x2 is the humidex (a composite of air temperature and humidity), time represents the number of the day in a 14-year time series, z is a (possible) confounder (namely rain fall) and y_count represents daily number of deaths.

For sake of completeness (but not relevant for the question, I think), the models include distributed lag (set up as a 7-column matrix, as in the code by Simon Wood in the gamair package).

The models are as follows:

m1 <- gam(count_y ~ s(time) + te(x1, lag) + te(z, lag), 
                    data = dat1, family = nb, method = 'REML', select = TRUE)
m2 <- gam(count_y ~ s(time) + te(x2, lag) +te(z, lag), 
                    data = dat2, family = nb, method = 'REML', select = TRUE)

Note that n for m1 = 2178 and n for m2 = 3256.

The reason for the difference in sample size is that each model is only run for the hotter season in each year, as defined by the x1 and x2 variables, i.e. the hotter season for x1 is different from that for x2, although there is considerable overlap between them.

Partial output from summary() for m1:

Approximate significance of smooth terms:
                                   edf Ref.df Chi.sq  p-value    
s(time)                          5.715     79 32.076 2.83e-06 ***
te(x1,lag)                       1.642     45  2.493    0.176    
te(z,lag)                        4.937     44 27.771 7.00e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0258   Deviance explained = 2.91%
-REML =   3615  Scale est. = 1         n = 2178

And for m2:

Approximate significance of smooth terms:
                                      edf Ref.df  Chi.sq p-value    
s(time)                          30.82300    199 199.916  <2e-16 ***
te(x2,lag)                        2.40194     44   4.662  0.0782 .  
te(z,lag)                         0.01763     44   0.014  0.4551    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0606   Deviance explained = 6.26%
-REML = 5506.6  Scale est. = 1         n = 3256

As can be seen, the p-value for x2 is closer to significance than for x1. The deviance explained favours model m2. However, the (negative) REML score for x2 is higher than for x1, therefore favouring model m1 (I think?).

I have two questions:

  1. Is it (in principle, and ignoring deviance and REML scores for a moment) valid to favour m2 based on lower p-value of x2?
  2. How come REML and explained deviance seem to favour different models, and which one is more appropriate in this case?

Best Answer

Q1

This is not a valid interpretation of those statistical tests: you are inappropriately inverting the statistical hypotheses being tested.

The tests for the smooths in summary.gam() are based on a $\text{H}_{0}$ that $\sum_{k=1}^{7} f_j(\texttt{x}_{ji}, \texttt{lag}_{ik}) = \boldsymbol{0}$ (I think I have that right).

An interpretation of a p value is as a weight of evidence in the data/model for $\text{H}_{0}$. It is not a weight of evidence in support of the alternative hypothesis, so these values shouldn't be used to justify favouring one model term over another. This is the same problem of interpreting a p value in any other statistical test as saying something about the alternative hypothesis being true; in invoking the test in the first place we have to assume that $\text{H}_{0}$ is true, so we can't simultaneously treat the p value as indicating anything about $\text{H}_{1}$

Q2

The REML score reported in the output is the negative of the restricted log-likelihood of the data given the model parameters. As such, the REML score depends on the data and because your two models are fitted to different data the comparison by their REML scores is meaningless, in the same way that comparing these two models by AIC would be meaningless.

Related Question