Solved – Am I interpreting the mixed effects Cox regression models properly? (two-level nested survival data)

cox-modelmixed modelrsurvival

I: Data structure issue (two-level nested hierarchy in survival data)

[This analysis was done in R v4.0.1, using the Coxme package for mixed effects Cox regression models]

I have collected data on ants trying to survive in different thermal conditions, and I hypothesize that the surface area of these ants as well as the chemical compounds they possess determine their survivability. These ants were assayed in groups of 20 (called “Samples”) replicated for each treatment, and these samples came from a few different nests. The response variable (survival, as a Surv(Time,Event) object) is measured for each individual, but each individual’s body size is averaged at the level of “Sample”, and their chemical compound data is averaged at the level of “nest”. This hierarchical structure to the data looks like this:

> head(df, n=10)

##    NEST SAMPLE TIME EVENT       SA     X1675   X1700    X1760
## 1   ABJ   ABJ1   16     1 3.374059 0.3677422 2.46877 1.230814
## 2   ABJ   ABJ1   16     1 3.374059 0.3677422 2.46877 1.230814
## 3   ABJ   ABJ1   16     1 3.374059 0.3677422 2.46877 1.230814
## 4   ABJ   ABJ1   20     1 3.374059 0.3677422 2.46877 1.230814
## 5   ABJ   ABJ1   20     1 3.374059 0.3677422 2.46877 1.230814
## 6   ABJ   ABJ2   16     1 3.184234 0.3677422 2.46877 1.230814
## 7   ABJ   ABJ2   16     1 3.184234 0.3677422 2.46877 1.230814
## 8   ABJ   ABJ2   16     1 3.184234 0.3677422 2.46877 1.230814
## 9   ABJ   ABJ2   16     1 3.184234 0.3677422 2.46877 1.230814
## 10  ABJ   ABJ2   20     1 3.184234 0.3677422 2.46877 1.230814

Here we can see one NEST (ABJ) that multiple SAMPLES were taken from (ABJ1,ABJ2,etc.), and while surface area (SA) is unique to each sample (3.37, 3.18, etc.), the chemical data (X1675, X1700, etc.) is unique to each nest. Because of this structure, I don’t think I can just do a usual Cox Proportional Hazard model (like coxph()).

II: Comparing models, considering random effects

To deal with this nested hierarchy, I tried the following approach of mixed effects model comparison to see if adding my surface area and chemical data as predictor variables would change the AIC scores/log likelihood(loglik) of models allowing each NEST and SAMPLE to have a random intercept. I may not be explaining this well, but here is what that comparison looked like (method based on this YouTube video, and the models are using coxme()):

> anova(m1,m2,m3)

## Analysis of Deviance Table
##  Cox model: response is  Surv(Time, Event)
##  Model 1: ~1 + (1 | nest/Sample)
##  Model 2: ~1 + sa + (1 | nest/Sample)
##  Model 3: ~1 + sa + (1 | nest/Sample) + X1675 + X1880 + X2600 + X2700 +      X2800 + X2900 + X3100
##    loglik   Chisq Df P(>|Chi|)    
## 1 -9946.9                         
## 2 -9946.0  1.7814  1     0.182    
## 3 -9930.6 30.8123  7 6.733e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1 consideres only random effects for each SAMPLE nested within NEST. Model 2 adds on the sample-level metric of surface area, and we see no change in loglik. Model 3 adds on the nest-level chemical data, and we see a significant change to loglik. So, my interpretation of this is that the chemical data, despite being a nest-level metric that is identical across many individuals, still adds meaningful “explanatory power” to the differences we see in survivability. The problem is, I would be satisfied here if this model considered all predictor variables I want to include. There are 7 more chemical compounds I want to include, but coxme() can’t handle too many predictor variables while considering random intercepts, and I’ve read that, as a general rule of thumb, random effects + many predictor variables = less accurate coefficient estimates. Also, while these model comparisons tell us vaguely what predictor variables explain survivability, the hazard ratios that can be generated from Cox Proportional Hazard models (with no random effects) offer the best metrics for judging how each variable influences survivability.

III: The problem (am I following the rules?)

This is where I am least certain about my model formulation + choice. I figured, if a model containing random effects + chemical data is better than a model of just random effects, then the random effects are not critical to consider. Therefore, we can thus examine a model of just the chemical data, and focus on the hazard ratios to estimate how each chemical compound influences ant survivability. So, my questions, stated finally, would be: is it safe to ignore random effects because of my model comparisons so far, or do I need to stick to a model incorporating random effects, thus limiting the amount of predictor variables I want to include?

I can provide more information where needed. Thanks for the help!

Best Answer

Because of this structure, I don’t think I can just do a usual Cox Proportional Hazard model (like coxph()).

That's not necessarily true, and if it is true it might not be for the reason you think. The fact that, for technical reasons, many individuals have the same values for areas and concentrations doesn't by itself rule out a simple coxph() analysis. The problem is that survival among individuals in one nest or sample might differ systematically from survival in another nest or sample under otherwise identical conditions. You need to take such correlations among associated individuals into account.

If you only have "a few" different nests, you might be able to treat nests as fixed effects. Then you could use the standard coxph() function, with a cluster variable handling the correlations within samples. For anything more elaborate, coxme() is a reasonable choice.

I’ve read that, as a general rule of thumb, random effects + many predictor variables = less accurate coefficient estimates.

This depends on the details. If there are more than a few groups, modeling with random effects typically poses less of a problem in this respect than does modeling with fixed effects. With random effects you are mainly modeling the variance among the effects rather than the individual effects themselves, reducing the number of parameter values you have to estimate from the data.

The number of events is typically the major limiting factor for reliably fitting a survival model. The usual rule of thumb is to have about 15 events per coefficient that you are estimating in the model to avoid overfitting.

You have surface area and 14 compounds* that you would like to evaluate as fixed effects, along with some variances among random effects. So if you have something like 300 or more events, you probably would be OK with random effects for samples and nests, and fixed effects for all your compounds and surface area. Yes, the variance of coefficient estimates does increase as you try to estimate more coefficients and thus reduce the residual degrees of freedom, but if you have a large enough event/coefficient ratio to avoid overfitting that shouldn't be a big problem.

I figured, if a model containing random effects + chemical data is better than a model of just random effects, then the random effects are not critical to consider. Therefore, we can thus examine a model of just the chemical data ...

As Robert Long's comment implies, that would not be "following the rules." One way or another, you need to take into account the lack of independence among observations that you have with this type of structure. One can discuss whether that is best done with nests treated as fixed versus random effects, or whether random effects in a mixed model versus clusters in a standard Cox model are the best ways to handle samples. But you have to take the correlations among observations into account somehow.

*A couple of cautions about the chemical-concentration analysis. First, it's not clear when the concentrations are measured. For survival analysis those concentrations should be those present at the event times. If, for example, concentrations are measured at the end of the experiment, then it might be that the deaths are leading to the concentration changes rather than vice-versa. Second, if you have fewer nests than the 14 compounds you want to evaluate, then including all the compounds along with the nests as predictors will lead to a linear dependence that might pose problems with random-effect modeling of nests and almost certainly will be an issue if nests are treated as fixed effects. If you've been having problems with coxme() when trying to include all 14 compounds, that (not some inherent problem with the number of fixed-effect predictors in a mixed-effects model) might be why. You might consider a ridge-regression type of penalization on the concentrations to get around that.

Related Solutions

Solved – Three level mixed-effects (crossed and nested) model in stata

It is difficult to expect that you can fit a nice normal distribution to 9 points, and that's what you are trying to achieve with the doctors and patients. Now, fitting a normal distribution with 2 points for stage is just impossible. I think Stata may not even be able to identify the variance parameter for stage here. You have to model that as a fixed effect. I would go as far as to say that you would want to treat everything as fixed effects, and just run

anova mortality_est doctorsid patient stage

To me, is an example where the complicated model is unnecessary (the perfect is the enemy of the good), and whoever is forcing you to fit the multilevel model to this data must be pretty evil (reviewer, adviser, instructor...).

If your ultimate goal is some sort of doctors' ICC or interrater agreement something, then I could see the point of the exercise, but again with just 9 observations, your variance component estimate will be woefully unstable, and effectively your CI for the ICC would look like (-0.1,0.8) or something like that. In the best case scenario of homoskedastic normal data, it is a (scaled) chi-square with 7 degrees of freedom (I think), so it has a mean of 7, a variance of 14, a standard deviation of 3.74, skewness of 1.07, and an excess kurtosis of 1.71; few statisticians would be happy to bet their publications on a $\chi^2_7$. In the worst case with doctor-patient interactions and heteroskedasticity, you may end up with something closer to chi-square with 1 or 2 d.f., and that isn't something you can really work with as a variance estimate.

Survival Analysis – Why Use Semi-Parametric Models (Cox Proportional Hazards) Instead of Fully Parametric Models?

If you know the parametric distribution that your data follows then using a maximum likelihood approach and the distribution makes sense. The real advantage of Cox Proportional Hazards regression is that you can still fit survival models without knowing (or assuming) the distribution. You give an example using the normal distribution, but most survival times (and other types of data that Cox PH regression is used for) do not come close to following a normal distribution. Some may follow a log-normal, or a Weibull, or other parametric distribution, and if you are willing to make that assumption then the maximum likelihood parametric approach is great. But in many real world cases we do not know what the appropriate distribution is (or even a close enough approximation). With censoring and covariates we cannot do a simple histogram and say "that looks like a ... distribution to me". So it is very useful to have a technique that works well without needing a specific distribution.

Why use the hazard instead of the distribution function? Consider the following statement: "People in group A are twice as likely to die at age 80 as people in group B". Now that could be true because people in group B tend to live longer than those in group A, or it could be because people in group B tend to live shorter lives and most of them are dead long before age 80, giving a very small probability of them dying at 80 while enough people in group A live to 80 that a fair number of them will die at that age giving a much higher probability of death at that age. So the same statement could mean being in group A is better or worse than being in group B. What makes more sense is to say, of those people (in each group) that lived to 80, what proportion will die before they turn 81. That is the hazard (and the hazard is a function of the distribution function/survival function/etc.). The hazard is easier to work with in the semi-parametric model and can then give you information about the distribution.