Splines in R – How to Address Scale Issues Using lme4::glmer

generalized linear modelglmmlme4-nlmersplines

Can anyone suggest how I parameterize a Poisson random-intercept model, with a natural cubic spline function? I've been using glmer for a while and am happy with how I'm specifying the main fixed effects and random intercept, but I get scale warnings related to my spline basis.

I'm modelling counts of 'incidents' as dependent variable, predicted by counts of observations in demographic categories (lets just use age for this post), with time period as a natural cubic spline of months with knots every 6 months, and a random intercept for organisations/clusters. I'm assuming that, as I'm using a log link function, I should log-transform my count predictors, and a simplified version is:

mod <- glmer(incidents ~ (1|org_code)
                  + log(age17)
                  + log(age29)
                  + log(age49)
                  + log(age69)
                  + log(age70)
                  + ns(re_month, knots=seq(6, 54, 6))
                  , data=sub
                  , family=poisson(link=log))

I'm afraid I'm unable to share my data, but my 'log(count)' variables are in the range 4 to 12 on log scale, and my ns() spline basis columns are in the range -1 to 1.

I've followed Ben Bolker's trouble shooting article:
http://rpubs.com/bbolker/lme4trouble1
and don't have singularity problems, mismatched scaled and absolute gradients etc. I don't think scale is appropriate, as I've already transformed to log scale. Different optimiser give similar results, but still getting:

In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

My questions are:

  • Am I going about this all wrong?

  • Is there another way to parametrize the spline? I've looked at
    gamm4, but I'd rather do it within glmer, as it develops from
    glm models to start with, and it's a bit of leap.

  • Is it reasonable to ignore the warning if I'm otherwise satisfied
    with the convergence? Is it 'safe' to use the AIC and parameter
    estimates from this?

Edit at the request of @IWS

Here is some additional info, as I'm afraid I can't supply the data. Here is are the summary stats for the fixed effects. The random-intercept is for repeated measure (60 in most cases) in 138 clusters. Coded as a factor.

          vars    n    mean      sd median trimmed     mad  min   max range skew kurtosis    se
incidents    1 8114  666.69  362.51  600.0  625.77  306.90    2  2784  2782 1.35     2.71  4.02
age17        3 8114 1245.34 1050.87  985.0 1056.90  553.01   40  8997  8957 2.99    12.68 11.67
age29        4 8114 2260.12 1268.73 1928.0 2095.78  984.45  259  9367  9108 1.39     2.51 14.08
age49        5 8114 4405.80 2502.30 3719.0 4054.05 1848.80  638 15533 14895 1.31     1.60 27.78
age69        6 8114 7303.33 3772.67 6302.5 6881.60 3438.89 1678 19142 17464 0.90     0.15 41.88
re_month     7 8114   30.51   17.33   31.0   30.52   22.24    1    60    59 0.00    -1.21  0.19

re_month is an integer of month from the start of a five year period e.g. 1 = Apr-2011, 2 = May-2011 etc. that is used to construct the spline with knots at 6-month intervals. The ns()1…10 variables below are generated by the call to ns().

(Intercept)   1, 1, 1, 1, 1, 1
log(age17)    5.796058, 6.086775, 5.948035, 5.940171, 5.872118, 5.717028
log(age29)    6.555357, 6.432940, 6.327937, 6.340359, 6.595781, 6.597146
log(age49)    7.189168, 7.183871, 7.262629, 7.117206, 7.180070, 7.176255
log(age69)    7.738924, 7.874359, 7.910591, 7.930566, 7.976595, 7.754053
log(age70)    8.761237, 8.800265, 8.674539, 8.700348, 8.646290, 8.620832
ns()1         0.6355301, 0.5711692, 0.4779412, 0.3700073, 0.2615287, ...
ns()2         0.2615741, 0.3703704, 0.4791667, 0.5740741, 0.6412037, ...
ns()3         0.0007716049, 0.0061728395, 0.0208333333, 0.0493827160,...
ns()4         0, 0, 0, 0, 0, 0
ns()5         0, 0, 0, 0, 0, 0
ns()6         0, 0, 0, 0, 0, 0
ns()7         0, 0, 0, 0, 0, 0
ns()8         -0.0260514200, -0.0133383270, -0.0056271067, -0.0016672...
ns()9         0.0781542599, 0.0400149811, 0.0168813201, 0.0050018726,...
ns()10        -0.0521028399, -0.0266766540, -0.0112542134, -0.0033345...

Best Answer

This is on the line between a statistical and a computational question, but I'll take a shot at it.

tl;dr the bottom line is that if you have achieved similar results with different optimizers, then you can trust that the numerics of your fit are OK, and you don't need to worry about these warnings. But you should check for overdispersion.

  • (agreeing with @IWS) log-transforming your predictors is not necessarily required - this would be done for scientific reasons (e.g. if you had a priori reason to believe that the underlying relationship was a power-law, $y=a x^b$, then you would want to fit a linear relationship $\log(y) = \log(a) + b \log(x)$; if you thought an exponential relationship $y=\exp(a+bx)$ was more reasonable, then you wouldn't log-transform), or (phenomenologically) to improve the linearity of the relationship. If your predictors as well as your response are counts, though, it does seem reasonable to log-transform.
  • if you want to try centering the ageXX parameters on the log scale (which might help a bit for interpretation, although since you have no interactions in your model it will only affect the intercept parameter), you can use log(ageXX/mean(ageX)). I agree that scaling after logging makes little sense.
  • your spline model seems fine. The major advantage of gamm4 is that you wouldn't have to pick the number of knots yourself. It might not be that hard to switch, but I don't think it's necessary. If your data are non-uniformly distributed along re_month you might want to try specifying just the number of knots and let ns() pick the knot locations itself.
  • you should really check for overdispersion, and (if necessary) use either (1) an observation-level random effect (2) lme4::glmer.nb or (3) glmmTMB

The "large eigenvalue" error is triggered if the ratio of the largest to the smallest eigenvalue of the model Hessian (all eigenvalues should be positive) is too large. If you want to investigate further which model components are causing your problem, you can try something like this example:

 library(lme4)
 gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
               data = cbpp, family = binomial)
 evd <- eigen(gm1@optinfo$derivs$Hessian,symmetric=TRUE)
 ## identify params; this works with a single scalar random effect,
 ## would be more complicated with fancier RE model
 cnames <- c("theta",names(fixef(gm1)))
 dimnames(evd$vectors) <- list(cnames,cnames)
 print(evd,digits=3)
 ## eigen() decomposition
 ## $values
 ## [1] 68.01 51.20 28.51 17.95  9.56

 ## $vectors
 ##               theta (Intercept) period2 period3 period4
 ## theta       0.93012      0.3581  0.0278  0.0513 -0.0567
 ## (Intercept) 0.36178     -0.8559 -0.1011 -0.2506  0.2520
 ## period2     0.04846     -0.3050  0.7562  0.4692 -0.3356
 ## period3     0.03953     -0.2022 -0.6434  0.6410 -0.3643
 ## period4     0.00932     -0.0723 -0.0560 -0.5510 -0.8294

Here you would look at the first eigenvector and hope that something jumped out at you.