Solved – Difference Between Discrete Time Proportional Hazards and Logistic Regression

generalized linear modellogisticmodelingpredictive-modelssurvival

My data consists of one row per person, per month that person was "exposed" to an event. So the month is the discrete time and the row corresponds to one "person-month".

There are a few independent variables, most of which are time-independent: Age at enrolment, Region, Type of individual. The time period (month) is obviously also a variable. The dependent variable is whether or not the event occurred in that month. If the event doesn't occur in a 48-month window from enrolment, every row has a zero.

Clarification based on a comment below: there are only as many rows as # of months until the event. There are only 48 rows for the individuals who reach month 48 without the event occurring. Also, the event could technically happen after month 48, but the probability of that happening is really, really small, so the data is censored for those folks, but not in a meaningful way. I can essentially treat the 48th month as a hard limit.

The event can only occur once. (i.e. the individual couldn't experience the event then start over, so there can't be two clusters of the same individual.)

I am modeling this with logistic regression and getting good results. But, do I need to be worried about the fact that the rows are not completely indpendent? And is this what the difference is between straight-up logistic regression and discrete time proportional hazards modeling using the logit?

I read about using a generalized linear mixed model instead of a standard GLM, but can't find a clear description of why that would make a difference. (or exactly what time as a random effect really means.) I also found descriptions where a standard logistic regression model was used.

Also, in my specific case, I'm not sure there is any real theoretical dependence row-to-row, because none of the factors are affected by the individual.

So, any insight would be appreciated.

Best Answer

[I'm rebooting my answer based on our previous discussion and new insights]

I am modeling this with logistic regression and getting good results. But, do I need to be worried about the fact that the rows are not completely indpendent? And is this what the difference is between straight-up logistic regression and discrete time proportional hazards modeling using the logit?

You don't have to worry about the independence of the rows here, because this analysis has an alternate statistical interpretation which doesn't have this issue. Instead of standard logistic regression, you can think of what you're doing as using a logistic regression solver to fit a discrete time logit PH model.

This is possible because the likelihood function for logistic regression is identical to that for the PH model if the data is input the way you specified (where each subject has rows until the event occurs or we hit 48 months). So you can use the logistic solver to fit the PH model and get perfectly valid fit coefficients and standard errors. The only difference between the two approaches arises when you need to interpret the results or predict something.

A similar phenomenon occurs in continuous time, where a Poisson GLM solver can be used to fit a continuous time proportional hazards model. See here for more info.

I read about using a generalized linear mixed model instead of a standard GLM, but can't find a clear description of why that would make a difference. (or exactly what time as a random effect really means.) I also found descriptions where a standard logistic regression model was used.

Mixed models would not be useful for your situation - you might use that if you had a more complex longitudinal dataset and wanted to account for the correlation of multiple events for one individual. In this case the individual's identity (or membership in some group) would be the random effect. Time would not be the random effect here. In any case, you only have one event so you don't need mixed models.

Related Question