Can anyone suggest how I parameterize a Poisson random-intercept model, with a natural cubic spline function? I've been using glmer
for a while and am happy with how I'm specifying the main fixed effects and random intercept, but I get scale warnings related to my spline basis.
I'm modelling counts of 'incidents' as dependent variable, predicted by counts of observations in demographic categories (lets just use age for this post), with time period as a natural cubic spline of months with knots every 6 months, and a random intercept for organisations/clusters. I'm assuming that, as I'm using a log link function, I should log-transform my count predictors, and a simplified version is:
mod <- glmer(incidents ~ (1|org_code)
+ log(age17)
+ log(age29)
+ log(age49)
+ log(age69)
+ log(age70)
+ ns(re_month, knots=seq(6, 54, 6))
, data=sub
, family=poisson(link=log))
I'm afraid I'm unable to share my data, but my 'log(count)' variables are in the range 4 to 12 on log scale, and my ns()
spline basis columns are in the range -1 to 1.
I've followed Ben Bolker's trouble shooting article:
http://rpubs.com/bbolker/lme4trouble1
and don't have singularity problems, mismatched scaled and absolute gradients etc. I don't think scale
is appropriate, as I've already transformed to log scale. Different optimiser give similar results, but still getting:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
My questions are:
-
Am I going about this all wrong?
-
Is there another way to parametrize the spline? I've looked at
gamm4
, but I'd rather do it withinglmer
, as it develops from
glm
models to start with, and it's a bit of leap. -
Is it reasonable to ignore the warning if I'm otherwise satisfied
with the convergence? Is it 'safe' to use the AIC and parameter
estimates from this?
Edit at the request of @IWS
Here is some additional info, as I'm afraid I can't supply the data. Here is are the summary stats for the fixed effects. The random-intercept is for repeated measure (60 in most cases) in 138 clusters. Coded as a factor.
vars n mean sd median trimmed mad min max range skew kurtosis se
incidents 1 8114 666.69 362.51 600.0 625.77 306.90 2 2784 2782 1.35 2.71 4.02
age17 3 8114 1245.34 1050.87 985.0 1056.90 553.01 40 8997 8957 2.99 12.68 11.67
age29 4 8114 2260.12 1268.73 1928.0 2095.78 984.45 259 9367 9108 1.39 2.51 14.08
age49 5 8114 4405.80 2502.30 3719.0 4054.05 1848.80 638 15533 14895 1.31 1.60 27.78
age69 6 8114 7303.33 3772.67 6302.5 6881.60 3438.89 1678 19142 17464 0.90 0.15 41.88
re_month 7 8114 30.51 17.33 31.0 30.52 22.24 1 60 59 0.00 -1.21 0.19
re_month is an integer of month from the start of a five year period e.g. 1 = Apr-2011, 2 = May-2011 etc. that is used to construct the spline with knots at 6-month intervals. The ns()1…10 variables below are generated by the call to ns().
(Intercept) 1, 1, 1, 1, 1, 1
log(age17) 5.796058, 6.086775, 5.948035, 5.940171, 5.872118, 5.717028
log(age29) 6.555357, 6.432940, 6.327937, 6.340359, 6.595781, 6.597146
log(age49) 7.189168, 7.183871, 7.262629, 7.117206, 7.180070, 7.176255
log(age69) 7.738924, 7.874359, 7.910591, 7.930566, 7.976595, 7.754053
log(age70) 8.761237, 8.800265, 8.674539, 8.700348, 8.646290, 8.620832
ns()1 0.6355301, 0.5711692, 0.4779412, 0.3700073, 0.2615287, ...
ns()2 0.2615741, 0.3703704, 0.4791667, 0.5740741, 0.6412037, ...
ns()3 0.0007716049, 0.0061728395, 0.0208333333, 0.0493827160,...
ns()4 0, 0, 0, 0, 0, 0
ns()5 0, 0, 0, 0, 0, 0
ns()6 0, 0, 0, 0, 0, 0
ns()7 0, 0, 0, 0, 0, 0
ns()8 -0.0260514200, -0.0133383270, -0.0056271067, -0.0016672...
ns()9 0.0781542599, 0.0400149811, 0.0168813201, 0.0050018726,...
ns()10 -0.0521028399, -0.0266766540, -0.0112542134, -0.0033345...
Best Answer
This is on the line between a statistical and a computational question, but I'll take a shot at it.
tl;dr the bottom line is that if you have achieved similar results with different optimizers, then you can trust that the numerics of your fit are OK, and you don't need to worry about these warnings. But you should check for overdispersion.
ageXX
parameters on the log scale (which might help a bit for interpretation, although since you have no interactions in your model it will only affect the intercept parameter), you can uselog(ageXX/mean(ageX))
. I agree that scaling after logging makes little sense.gamm4
is that you wouldn't have to pick the number of knots yourself. It might not be that hard to switch, but I don't think it's necessary. If your data are non-uniformly distributed alongre_month
you might want to try specifying just the number of knots and letns()
pick the knot locations itself.lme4::glmer.nb
or (3)glmmTMB
The "large eigenvalue" error is triggered if the ratio of the largest to the smallest eigenvalue of the model Hessian (all eigenvalues should be positive) is too large. If you want to investigate further which model components are causing your problem, you can try something like this example:
Here you would look at the first eigenvector and hope that something jumped out at you.