Discrete-Time Hazard Model – Analyzing Censoring in Time-to-Drop-Out of Clinical Trials

discrete timesurvival

I am conducting a discrete-time hazard model where the outcome being analysed is time to drop out of treatment during a 12-week clinical trial examining the effect of an agonist drug on days of illicit drug use. I am examining the influence of nine factors on whether and when participants drop out of treatment. The outcome is a discrete numeric variable, weeks in the trial (min = 1, max = 12).

Based on chapter 12 of Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence by Singer and Willett I am conducting a discrete-time hazard model with complementary Log-Log link function. It is essentially a logistic regression model with a cLogLog link function instead of the usual logit; chosen because the outcome is discrete for measurement reasons not actually discrete (time participants actually dropped out could be any day between week markers, but we only have info for the week they dropped out).

Based on what I read in the book I created a person-period dataset that looks like this (have left out all the columns of predictors save 1 as they wouldn't fit on the display)

enter image description here

You can see participant 5 dropped out three weeks into the trial, and hence has three rows, with the event occurring on the third, whereas participant 6 stayed the entire 12 weeks and hence has twelve rows with no event.

I ran the model in R with syntax that looks like this. For non-R users (along with my apologies for parochialism) the model has reference-level coding and includes all 12 columns for each discrete time period as well as the nine predictors (the -1 is R syntax for 'remove the intercept`)

modLogLog <- glm(formula = event ~ D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9 + D10 + D11 + D12 + gender + group + isi_tot + sf_pain + cpq_tot + qcq_tot + dass_tot + grams_per_day + durationRegUse_dec - 1, 
                 family=binomial(link = "cloglog"), 
                 data = treatDF_PP)

The output of the model looks like this

enter image description here

That was a lot of preamble I know. But my question relates to the coefficient for D12. For one it is tiny (1.82e-08 after reversing the complementary Log-Log transformation via 1-exp(-exp(foo))) and conspicuously non-significant (p=0.98), especially noticeable juxtaposed against the significant coefficients for the other time periods.

At first I was concerned that I had make some drastic mistake creating the person period dataset, but I checked and everything looked correct. But this morning at about 4am it came to me: the coefficient might be caused by all participants who stayed in treatment to week 12 being right-censored at week 12, which would effectively mean that, during the week 12 "window", the hazard of dropout is effectively zero (i.e. the dropout 'event' cannot occur at this time period, which means no 1's in the event column for anyone who has a 1 in the D12 column).

Which brings me to my two-part question

Was my 4am epiphany right? Is the out-of-place coefficient due to no events occurring during that time period?

and, more importantly

Should I remove the D12 column as a predictor from my model because no events can occur in that week?

Insights much appreciated

Best Answer

Your "epiphany" seems to be correct.

A large coefficient with an enormous standard error suggests that you are (close to) perfect separation, with combinations of predictors that (almost) perfectly define the outcomes. If your data are structured in a way that there can be no events during the last time period, then predictions for that last time period are trivial: no "events" as you define them. There is no reason to include that last time period in this situation.

The distinction here from usual discrete-time survival models is that the events you are modeling would be considered right censoring (loss to follow up) in other survival models. In those other models, unlike yours, there can be events during the last time period.