Binary Outcome Modelling – Adjusting for Varying Census Intervals

lme4-nlmelogisticmortalityr

For a current piece of work I’m trying to model the probability of tree death for beech trees in a woodland in the UK. I have records of whether trees were alive or dead for 3 different census periods along with data on their diameter and growth rate. Each tree has an ID number so it can be identified at each time interval. However, the census intervals vary so that for the time between one survey and another is either 4, 12 or 18 years. Obviously the longer the census period the greater the probability a tree will have died by the time it is next surveyed. I had problems making a realistic reproducible example so you can find the data here.

The variables in the dataset are:

  1. ID – Unique ID for tree
  2. Block – the ID for the 20x20m plot in which the tree was located
  3. Dead – Status of tree, either dead (1) or alive (0)
  4. GR – Annual growth rate from previous survey
  5. DBH – diameter of tree at breast height
  6. SL – Length of time between censuses in years

Once a tree is recorded as dead it disappears from subsequent surveys.

Ideally I would like to be able to estimate the annual probability of mortality of a tree using information on diameter and growth rate. Having searched around for quite a while I have seen that logistic exposure models appear able to account for differences in census periods by using an altered version of logit link for binomial models as detailed by Ben Bolker here. This was originally used by Shaffer to determine the daily probability of bird nest survival where the age (and therefore exposure) of the nest differed. I've not seen it used outside of the context of models of nest survival but it seems like I should be able to use it to model survival/mortality where the exposure differs.

require(MASS)
logexp <- function(exposure = 1)
{
  linkfun <- function(mu) qlogis(mu^(1/exposure))
  ## FIXME: is there some trick we can play here to allow
  ##   evaluation in the context of the 'data' argument?
  linkinv <- function(eta)  plogis(eta)^exposure
  logit_mu_eta <- function(eta) {
    ifelse(abs(eta)>30,.Machine$double.eps,
           exp(eta)/(1+exp(eta))^2)
    ## OR .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
  }
  mu.eta <- function(eta) {       
    exposure * plogis(eta)^(exposure-1) *
      logit_mu_eta(eta)
  }
  valideta <- function(eta) TRUE
  link <- paste("logexp(", deparse(substitute(exposure)), ")",
                sep="")
  structure(list(linkfun = linkfun, linkinv = linkinv,
                 mu.eta = mu.eta, valideta = valideta, 
                 name = link),
            class = "link-glm")
}

At the moment my model looks like this, but I will incorporate more variables as I go along:

require(lme4)
Dead<-read.csv("Stack_dead.csv",)


M1<-glmer(Dead~DBH+(1|ID),data=Dead,family=binomial(logexp(Dead$SL))) 
#I use (1|ID) here to account for the repeated measurements of the same individuals
    summary(M1)

plot(Dead$DBH,plogis(predict(M1,re.form=NA)))

Primarily I want to know:

  1. Does the statistical technique I am using to control for the difference in time between census seem sensible? If it isn't, can you think of a better way to deal with the problem?
  2. If the answer to the first question is yes, is using the inverse logit (plogis) the correct way to get predictions expressed as probabilities?

Thanks in advance for any help!

Best Answer

An alternative and slightly easier approach is to use a conditional log-log link (cloglog), which estimates the log-hazard rather than the log-odds of outcomes such as mortality. Copying from rpubs:

A very common situation in ecology (and elsewhere) is a survival/binary-outcome model where individuals (each measured once) differ in their exposure. The classical approach to this problem is to use a complementary log-log link. The complementary log-log or "cloglog" function is $C(\mu)=\log(−\log(1−\mu))$; its inverse is $\mu=C^{−1}(\eta)=1−\exp(−\exp(\eta))$. Thus if we expect mortality $\mu_0$ over a period $\Delta t=1$ and the linear predictor $\eta=C^{−1}(\mu_0)$ then $$ C^{−1}(\eta+\log\Delta t)=(1−\exp(−\exp(\eta) \cdot \Delta t)) $$ Some algebra shows that this is equal to $1−(1−\mu_0)^{\Delta t}$, which is what we want.

The function $\exp(−\exp(x))$ is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a cloglog link to the survival, rather than the mortality, probability) is also called a gompit (or extreme value) regression.

To use this approach in R, specify family=binomial(link="cloglog") and add a term of the form offset(log(exposure)) to the formula (alternatively, some modeling functions take offset as a separate argument). For example,

glm(surv~x1+x2+offset(log(exposure)),
    family=binomial(link="cloglog"),
    data=my_surv_data)

where exposure is the length of time for which a given individual is exposed to the possibility of dying/failing (e.g., census interval or time between observations or total observation time).

You may also want to consider checking the model where log(exposure) is included as a covariate rather than an offset - this makes the log-hazard have a $\beta_t \log(t)$ term, or equivalently makes the hazard proportional to $t^{\beta_t}$ rather than to $t$ (I believe this makes the survival distribution Weibull rather than exponential, but I haven't checked that conclusion carefully).

Advantages of using this approach rather than Schaffer's power-logistic method:

  • because the exposure time is incorporated in the offset rather than in the definition of the link function itself, R handles this a bit more gracefully (for example, it will be easier to generate predictions with different exposure times from the original data set).
  • it is slightly older and more widely used in statistics; Googling "cloglog logistic regression" or searching for cloglog on CrossValidated will bring you to more resources.

The only disadvantage I can think of off the top of my head is that people in the nest survival world are more used to Schaffer's method. For a large enough data set you might be able to tell which link actually fits the data better (e.g. fit with both approaches and compare AIC values), but in general I doubt there's very much difference.

Related Question