Time Series – Importance of Hessian in GAM Output for Significance of Smooths and AIC Comparison

generalized-additive-modelhessianmgcvmortalitytime series

Would really appreciate some help as I don't know how to proceed given my GAM results. I haven't had much luck getting answers to my GAM questions (not sure if there is something wrong with the way I am asking questions), and don't know anyone working with GAMs.

I have run two GAMs (in R/ mgcv) which I would like to compare with AIC:

stime1 <- gam(deaths~s(time, k = 700, bs='tp') + heap + 
                     te(temp_max, lag, k=c(15,6), bs='tp')+ 
                     te(precip_daily_total, lag, k=c(7,6), bs='tp'), 
                     data = dat, family = nb, method = 'REML', select = TRUE)

stime2 <- gam(deaths~s(time, k = 700, bs='tp') + heap + 
                     te(temp_mean, lag, k=c(17,6), bs='tp')+ 
                     te(precip_daily_total, lag, k=c(7,6), bs='tp'), 
                     data = dat, family = nb, method = 'REML', select = TRUE)

The GAMs use daily time series data (over 14 years, no missing data) to predict number of deaths from temperature and precipitation. The models include a distributed lag term (6 days) because deaths may occur over a period of time following (high) exposure.

Lag was set up as 7 column matrices of the lagged weather variables, following the example in Simon Wood's 2017 book p. 349. Full code can be found at the bottom of the post. The data file can be accessed on Github.

For stime2, the output from gam.check does not mention anything about the Hessian whereas the output for stime1 does (for output see bottom of post). So I am assuming that the Hessian was not calculated for stime2.

I am new to GAMs though have been learning a great deal at the conceptual level from @gavinsimpson 's webinar and @noamross' Datacamp course. I also have Simon Wood's book, but am really struggling with the math notations in it. As of yet, I only understand a little about how GAMs work "under the hood". However, my understanding is that a Hessian matrix contains the partial second derivatives of the smooth functions. Since the second derivative is used to penalise wiggliness, it sounds quite serious to me when the Hessian is not computed.

What's happening here? Does this matter for comparison of models with AIC and does it matter for the interpretation of the significance of smooths?

Output from gam.check for stime1:

Method: REML   Optimizer: outer newton
full convergence after 12 iterations.
Gradient range [-0.002708667,0.001061138]
(score 8906.308 & scale 1).
Hessian positive definite, eigenvalue range [4.623881e-07,18.40785].
Model rank =  993 / 993 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                               k'    edf k-index p-value  
s(time)                    699.00  40.70    0.96   0.065 .
te(temp_max,lag)            89.00   3.99      NA      NA  
te(precip_daily_total,lag)  36.00   1.88      NA      NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Output from gam.check for stime2:

Method: REML   Optimizer: outer newton
full convergence after 10 iterations.
Gradient range [-2.069522e-05,4.700218e-05]
(score 8900.599 & scale 1).
eigenvalue range [-6.126244e-06,20.16501].
Model rank =  1005 / 1005 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                               k'    edf k-index p-value  
s(time)                    699.00  29.55    0.95    0.03 *
te(temp_mean,lag)          101.00   5.69      NA      NA  
te(precip_daily_total,lag)  36.00   1.99      NA      NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Full code:

rm(list = ls()) # start with clean environment

library(readr)
library(mgcv)

df <- read_rds("data_crossvalidated_post2.rds")

# Create matrices for lagged weather variables (6 day lags) based on example by Simon Wood 
# in his 2017 book ("Generalized additive models: an introduction with R", p. 349) and 
# gamair package documentation (https://cran.r-project.org/web/packages/gamair/gamair.pdf, p. 54)

lagard <- function(x,n.lag=7) {
n <- length(x); X <- matrix(NA,n,n.lag)
for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1]
X
}

dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE), deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month, weekday = df$weekday, week = df$week, monthday = df$monthday, time = df$time, heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum = df$precip_hourly_dailysum)
dat$temp_max <- lagard(df$temp_max)
dat$temp_min <- lagard(df$temp_min)
dat$temp_mean <- lagard(df$temp_mean)
dat$wbgt_max <- lagard(df$wbgt_max)
dat$wbgt_mean <- lagard(df$wbgt_mean)
dat$wbgt_min <- lagard(df$wbgt_min)
dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max)
dat$sh_mean <- lagard(df$sh_mean)
dat$solar_mean <- lagard(df$solar_mean)
dat$wind2m_mean <- lagard(df$wind2m_mean)
dat$sh_max <- lagard(df$sh_max)
dat$solar_max <- lagard(df$solar_max)
dat$wind2m_max <- lagard(df$wind2m_max)
dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean)
dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum)
dat$precip_hourly <- lagard(df$precip_hourly)
dat$precip_daily_total <- lagard( df$precip_daily_total)
dat$temp <- lagard(df$temp)
dat$sh <- lagard(df$sh)
dat$rh <- lagard(df$rh)
dat$solar <- lagard(df$solar)
dat$wind2m <- lagard(df$wind2m)

knots <- list(doy=c(0.5, 366.5)) # set knots for cyclic spline for doy

stime1 <- gam(deaths~s(time, k = 700, bs='tp') + heap + 
                     te(temp_max, lag, k=c(15,6), bs='tp')+ 
                     te(precip_daily_total, lag, k=c(7,6), bs='tp'), 
                     data = dat, family = nb, method = 'REML', select = TRUE)

stime2 <- gam(deaths~s(time, k = 700, bs='tp') + heap + 
                     te(temp_mean, lag, k=c(17,6), bs='tp')+ 
                     te(precip_daily_total, lag, k=c(7,6), bs='tp'), 
                     data = dat, family = nb, method = 'REML', select = TRUE)

Best Answer

It's not that the hessian wasn't computed but that it was determined to not be positive definite.

The hessian is not the (partial) second derivatives of the estimated smooth function themselves. It is the second derivatives of the log-likelihood at the maximum likelihood estimates of the model coefficients. If we take the inverse of the negative of the hessian we get the covariance matrix of the model coefficients, and the square roots of the diagonals of that covariance matrix are what gives us the estimates of the standard errors of the individual model coefficients.

Often, minus direct access to Simon Wood's brain, one often has to read the code for mgcv to figure out what is happening.

Reading mgcv::gam.check we see:

        if (!is.null(b$outer.info)) {
            if (b$optimizer[2] %in% c("newton", "bfgs")) {
                boi <- b$outer.info
                cat("\n", boi$conv, " after ", boi$iter, " iteration", 
                  sep = "")
                if (boi$iter == 1) 
                  cat(".")
                else cat("s.")
                cat("\nGradient range [", min(boi$grad), ",", 
                  max(boi$grad), "]", sep = "")
                cat("\n(score ", b$gcv.ubre, " & scale ", b$sig2, 
                  ").", sep = "")
                ev <- eigen(boi$hess)$values
                if (min(ev) > 0)  ## <---------------------------- HERE
                  cat("\nHessian positive definite, ")
                else cat("\n")
                cat("eigenvalue range [", min(ev), ",", max(ev), 
                  "].\n", sep = "")
            }
            else {
                cat("\n")
                print(b$outer.info)
            }
        }

This is the branch of the code you appear to be in (GAM estimated by the outer iteration algorithm, which is the default). Look at the line indicated by ## <------------------ HERE (Note that the line above this one clearly incidates that the hessian is computed, otherwise there would be other errors shown, not least from eigen(), before the esntire output was displayed.)

This is a check on the eigenvalues of the hessian evaluated at the MLE of the model parameters. If the minimum eigenvalue is > 0, the message (it's not actually a message in R terms) "Hessian positive definite" is printed. If you look at the range of the eigenvalues printed in the output, the smallest eigenvalue in the first model is slightly positive and thus the check for positive definiteness is passed and the message printed. In the second example, the smallest eigenvalue is slightly negative, hence the check fails and the message is not printed, indicating that as far as the code is concerned the hessian is not positive definite.

By the definitions of positive definiteness, such a matrix must have strictly positive eigenvalues. Such matrices are invertible (so we can get the covariance matrix). Matrices that are not positive definite are not always invertible. As the hessian should be invertible, that the one for your second model is (may) not is an indication that something is not right with the model.

A good starting point for figuring out what is wrong, is to simplify the model, as convergence or related problems usually result from you asking too much from a given data set. That things go off the rails when you increase k to 17, this would be my guess as to the source of the problem here (model complexity, but not just restricted to the distributed lag term for temp_mean).

Unless you have many decades of daily data (or equivalently-long periods for time units that are less than days), it seems implausible in the extreme that you need to start with a basis of 700! functions to estimate the s(time) term. If you were just increasing k here to try to get the p value for the s(time) term in the output from gam.check() to be < 0.05, I think that is the wrong strategy, especially as the EDF of the term is much smaller than k' - all this is indicating is short-scale temporal autocorrelation I suspect, and you are unlikely to fix that with a smooth of time.

If you have many decades (or the equivalent, many days, months, years) of data, there typically will be sub-units of time you can decompose variation into. For example, if the data are daily over many years, it would be prudent to model the seasonal variation in the response separately from the between-year trend. Hence modelling:

gam(y ~ te(year, doy, bs = c("cr", "cc")),
    family = nb, method = "REML", data = dat,
    knots = list(doy = c(0.5, 366.5)), ....)

would decompose time into seasonal (day of year, doy, but you could use month here if you don't have data on most days of the year) and longer-term trends, and the knots are set to allow each day to have it's own estimate but the ends of the years values should be similar - adjust for monthly data to something like month = c(0, 12.5)).

If you had hourly data and expected variation in deaths within the day, you can thrown in a s(hod) term with bs = "cc" if you have data for each hour and thus have data at 2300, 0000, and 0100 which should be in some way similar to one another.

For the effects of temp_mean and precip_daily_total, do we expect relationships with the response to be so non-linear as representable by 15 or 17 degrees of freedom? I would have thought that deaths would increase slowly as temperature rose, but the number of deaths per unit change in temperature would increase more quickly once the temperature got above some threshold. I'm not sure what I would expect for precipitation - extreme rain doesn't necessarily have a physiological response on the human body, up to a point.