Solved – parametric survival regression and discrete time survival regression

discrete timenonparametricpanel datastatasurvival

I would greatly appreciate if you could let me know how to choose among different parametric distributions including gama, Weibull, lognormal, loglogistic and etc for panel (time series cross sectional data) survival analysis or discrete time survival analysis in STATA 14.

I read these materials but they are about continuous time survival analysis:

http://spia.uga.edu/faculty_pages/rbakker/pols8501/OxfordTwoNotes.pdf

Then, I tried to calculate LR test, which is explained on page 22 of the second note, in order to calculate p_value. However, I am not sure or I don't know what to do.

Survival Distribution                     AIC          BIC       Log-Likelihood      df      
Exponential-Proportional Hazard:        433.663    471.1031       -209.83151         7      
Exponential-Accelerated Failure Time:   433.663    471.1031       -209.83151         7 

Lognormal-Proportional Hazard:          377.6502   420.4389       -180.82508         8                          
Loglogistic-Proportional Hazard:        377.874    420.6627       -180.93701         8 

Gama-Proportional Hazard:       cannot compute an improvement -- discontinuous region encountered
Weibull-Proportional Hazard:    cannot compute an improvement -- discontinuous region encountered

Weibull-Accelerated Failure Time:       205.8869   248.6756        -94.943472        8

Besides, I just could test PH assumption for cox model, which is not a kind of panel data.

What's more, I couldn't do what is instructed on pages 24 and 25 of Oxford second note. In fact, when I use the "predict" command, it gives me an array of continuous values even though my dependent variable is discrete.

My data set is as follows: ID represents different companies in my sample. Event shows that if the company went bankrupt or not. X1 to X5 are my independent variables.

ID  TIME    EVENT    x1      x2      x3      x4      x5
1   1        0      1.28    0.02    0.87    1.22    0.06
1   2        0      1.27    0.01    0.82    1.00    -0.01
1   3        0      1.05    -0.06   0.92    0.73    0.02
1   4        0      1.11    -0.02   0.86    0.81    0.08
1   5        1      1.22    -0.06   0.89    0.48    0.01
2   1        0      1.06    0.11    0.81    0.84    0.20
2   2        0      1.06    0.08    0.88    0.69    0.14
2   3        0      0.97    0.08    0.91    0.81    0.17
2   4        0      1.06    0.13    0.82    0.88    0.23
2   5        0      1.12    0.15    0.76    1.08    0.28
2   6        0      1.60    0.26    0.55    1.31    0.37
2   7        0      1.58    0.26    0.56    1.16    0.35
2   8        0      1.54    0.24    0.59    1.08    0.33
2   9        0      1.72    0.22    0.55    0.84    0.29
2   10       0      1.72    0.21    0.53    0.79    0.29
2   11       0      1.63    0.19    0.55    0.73    0.27
2   12       0      2.17    0.32    0.44    0.95    0.43
3   1        0      0.87    -0.03   0.79    0.61    0.00
3   2        1      0.83    -0.14   0.95    0.57    -0.02

Best regards,

Best Answer

First, with at most one bankruptcy event per company, you don't have data for a panel survival model as described on the page linked in your question. Quoting from that page:

Obviously, in survival data, we have repeated observations on the same person because we observed them over a period of time, from onset of risk until failure or the calling off of the data collection effort. Sometimes the multiple observations on a person are explicit; the data themselves contain multiple observations for some or all the individuals. That happens when covariates change over time. Other times, the multiple observations on the individuals are implicit; there is only one physical observation for each, but still that observation records a span of time.

Those kinds of repeated observations have nothing to do with panel data. Panel data arises, for instance, when individuals are from different countries and it was believed that country affects survival. In that case, in a panel-data model, there would be a random effect or, if you prefer, an unobserved latent effect for each country.

We can, however, write models in which the random effect occurs at the individual level if we have repeated failure events for them. (Emphasis added)

For example, if you were studying how quickly different people caught a common cold, which can happen often, you might include as a random effect the different tendencies for individuals to catch a cold, given the covariate values. But with only one event per individual you can't do that. What you have are data arranged in a standard format for survival analysis with time-dependent covariates. From the structure of your data you don't seem to have any random effects.

Second, unless you have a really strong reason to suspect that your survival data take a particular parametric form, you will typically be better off trying the semi-parametric approach of Cox proportional hazards regression rather than a parametric model. With Cox regression you don't need to know how the underlying hazard changes over time, removing an important assumption that you would have to make with a parametric model. Yes, you only have data for one time point per year, but that's no more incompatible with a Cox model than it would be with parametric models.

Third, you do have to be careful in the organization of the data for your model. The covariate values for any time point should represent their status just before the event. So it would be wrong to use year-end covariate values on the same row as bankruptcy events that could have occurred earlier in the year. Depending on the nature of your data you might need to reorganize the data so that covariates best represent predictors of events at the noted times.

Fourth, and perhaps most important, you might want to consider whether and how a survival model is appropriate for your study. For example, the Cox model assumes a basic shared shape of the hazard as a function of time starting from time 0. In clinical studies, for example, time 0 might be the time a patient received a particular treatment. In your case, the assumptions and implications of the assumption of a shared hazard-function shape could be quite different in 2 different scenarios: (a) if time 0 represents the time of formation of each company, or (b) time 0 represents, say, the calendar year 2000. You need to think carefully about how such assumptions correspond to what you know about the subject matter.

Update to OP's further questions

Could you please let me know if it is possible to use "cloglog" for your both methods?

You cannot get an interval censored model (i.e., cloglog link function) with the static_glm. However, you can use the get_survival_case_weights_and_data function in the same package as I show in the Comparing methods for time varying logistic models vignette and then use whatever classifier you want like glm with a cloglog link function.

Is it allowed to use your suggestions If some companies enter the study in time 4, some others in time 7 and etc.?

This is called delayed entry. It should not be problem in a discrete time default model if your time scale is the calendar date/year.

Really, I want to predict bankruptcy using survival analysis so my covariates should be lagged for example 1 year lag.

Yes, you need to lag your covariates.

As I tried logistic regression in Python - sklearn, the solver "sag" had a better performance. Is it allowed to use this solver in your suggestions? Thanks a lot.

Seems like "sag" is a penalized logistic model. It should not be problem if you set up your data correctly.

Best Answer

Related Solutions

Solved – discrete time survival analysis

Update to OP's further questions

Related Question