For a current piece of work I’m trying to model the probability of tree death for beech trees in a woodland in the UK. I have records of whether trees were alive or dead for 3 different census periods along with data on their diameter and growth rate. Each tree has an ID number so it can be identified at each time interval. However, the census intervals vary so that for the time between one survey and another is either 4, 12 or 18 years. Obviously the longer the census period the greater the probability a tree will have died by the time it is next surveyed. I had problems making a realistic reproducible example so you can find the data here.
The variables in the dataset are:
- ID – Unique ID for tree
- Block – the ID for the 20x20m plot in which the tree was located
- Dead – Status of tree, either dead (1) or alive (0)
- GR – Annual growth rate from previous survey
- DBH – diameter of tree at breast height
- SL – Length of time between censuses in years
Once a tree is recorded as dead it disappears from subsequent surveys.
Ideally I would like to be able to estimate the annual probability of mortality of a tree using information on diameter and growth rate. Having searched around for quite a while I have seen that logistic exposure models appear able to account for differences in census periods by using an altered version of logit link for binomial models as detailed by Ben Bolker here. This was originally used by Shaffer to determine the daily probability of bird nest survival where the age (and therefore exposure) of the nest differed. I've not seen it used outside of the context of models of nest survival but it seems like I should be able to use it to model survival/mortality where the exposure differs.
require(MASS)
logexp <- function(exposure = 1)
{
linkfun <- function(mu) qlogis(mu^(1/exposure))
## FIXME: is there some trick we can play here to allow
## evaluation in the context of the 'data' argument?
linkinv <- function(eta) plogis(eta)^exposure
logit_mu_eta <- function(eta) {
ifelse(abs(eta)>30,.Machine$double.eps,
exp(eta)/(1+exp(eta))^2)
## OR .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
}
mu.eta <- function(eta) {
exposure * plogis(eta)^(exposure-1) *
logit_mu_eta(eta)
}
valideta <- function(eta) TRUE
link <- paste("logexp(", deparse(substitute(exposure)), ")",
sep="")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta,
name = link),
class = "link-glm")
}
At the moment my model looks like this, but I will incorporate more variables as I go along:
require(lme4)
Dead<-read.csv("Stack_dead.csv",)
M1<-glmer(Dead~DBH+(1|ID),data=Dead,family=binomial(logexp(Dead$SL)))
#I use (1|ID) here to account for the repeated measurements of the same individuals
summary(M1)
plot(Dead$DBH,plogis(predict(M1,re.form=NA)))
Primarily I want to know:
- Does the statistical technique I am using to control for the difference in time between census seem sensible? If it isn't, can you think of a better way to deal with the problem?
- If the answer to the first question is yes, is using the inverse logit (plogis) the correct way to get predictions expressed as probabilities?
Thanks in advance for any help!
Best Answer
An alternative and slightly easier approach is to use a conditional log-log link (
cloglog
), which estimates the log-hazard rather than the log-odds of outcomes such as mortality. Copying from rpubs:A very common situation in ecology (and elsewhere) is a survival/binary-outcome model where individuals (each measured once) differ in their exposure. The classical approach to this problem is to use a complementary log-log link. The complementary log-log or "cloglog" function is $C(\mu)=\log(−\log(1−\mu))$; its inverse is $\mu=C^{−1}(\eta)=1−\exp(−\exp(\eta))$. Thus if we expect mortality $\mu_0$ over a period $\Delta t=1$ and the linear predictor $\eta=C^{−1}(\mu_0)$ then $$ C^{−1}(\eta+\log\Delta t)=(1−\exp(−\exp(\eta) \cdot \Delta t)) $$ Some algebra shows that this is equal to $1−(1−\mu_0)^{\Delta t}$, which is what we want.
The function $\exp(−\exp(x))$ is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a
cloglog
link to the survival, rather than the mortality, probability) is also called a gompit (or extreme value) regression.To use this approach in R, specify
family=binomial(link="cloglog")
and add a term of the formoffset(log(exposure))
to the formula (alternatively, some modeling functions take offset as a separate argument). For example,where
exposure
is the length of time for which a given individual is exposed to the possibility of dying/failing (e.g., census interval or time between observations or total observation time).You may also want to consider checking the model where
log(exposure)
is included as a covariate rather than an offset - this makes the log-hazard have a $\beta_t \log(t)$ term, or equivalently makes the hazard proportional to $t^{\beta_t}$ rather than to $t$ (I believe this makes the survival distribution Weibull rather than exponential, but I haven't checked that conclusion carefully).Advantages of using this approach rather than Schaffer's power-logistic method:
The only disadvantage I can think of off the top of my head is that people in the nest survival world are more used to Schaffer's method. For a large enough data set you might be able to tell which link actually fits the data better (e.g. fit with both approaches and compare AIC values), but in general I doubt there's very much difference.