I'm reluctant to code dummy variables myself, as you seem to have done, when R can do it for me; it's too easy to make an unexpected error. So rule out a coding problem first.
In your case, with categorical variables A and B, set them as factor variables with 3 levels each, and make sure the baseline level for each (A1 or B1) is set as the the reference level of the factor (e.g., relevel(A,ref=A1)
). Then your survival analysis takes the general form:
coxph(Surv(daysSurvival,status) ~ A + B + ..., data= yourData, ...)
The baseline model returned will be for the A1 and B1 reference values, and the coefficients for the other levels of A and B represent the additional hazard with respect to the A1/B1 cases.
At that point if your model doesn't fit the data well, you will know that your model rather than the coding is inadequate. Most important, use the tools available for coxph
models to check on validity of the proportional-hazards assumption.
Trying to make predictions from parametric models with time-varying covariates is not the easiest way for a beginner to start with survival analysis. A few thoughts.
First, think about why you are modeling.
A) If your main interest is in prediction, there might be no need to force your data into a particular standard parametric functional form like the Weibull. In particular, a Weibull model assumes proportional hazards (PH), so a more general PH model, a semi-parametric Cox model or a parametric PH model with a flexible baseline survival, might work better. Alternatively, you could try a non-PH accelerated failure time (AFT) model; only the Weibull family (with its limiting exponential case) meets both PH and AFT. See these course notes for a succinct introduction to parametric survival modeling.
B) If your main interest is interpretable inference, try to use a model in which the coefficients have reasonable interpretations. In your Weibull model, using the parameterization favored by the flexsurv
package, covariates modeled as part of the scale
parameter (the default, like for b
in your first model) have a nice interpretation in terms of speeding up or slowing down the passage of time. You can model covariates as part of the shape
parameter (like shape(a)
in your first model), but interpretation is not so straightforward. With an AFT model, the shape
parameter adjusts the width of the associated error distribution of log-survival times around what you would get from just the linear predictor;* a fixed shape
is easy to understand in that context. I have a hard time thinking about what allowing covariates to affect the shape
would mean in an AFT context.
Second, if only one event of one type per individual is possible then you don't need to take their id
values into account, as explained in the main survival
package time-dependence vignette on page 4. If you do have multiple events possible per subject, then you apparently shouldn't use flexsurvreg()
. The main flexsurv
vignette says (page 3): "The individual survival times are also [assumed] independent, so that flexsurv does not currently support shared frailty, clustered or random effects models."
Third, things that work in one software package might not in another. You might have avoided the confusion with differing Weibull parameterizations, but you got caught in your use of a cluster()
term in flexsurvreg()
. Although it's not supposed to deal with clustered data, in version 2.0 it seems to accept them in a strange way. In other R survival software, cluster()
terms affect the coefficient covariance matrix, leaving coefficient point estimates unaltered. When I added a cluster()
term to a flexsurvreg()
model, I got a coefficient estimate for cluster(id)
and altered values for the other coefficients! Not sure how to interpret that. You should omit your cluster()
term if you use flexsurv
.(That also solves one of your problems.)
Fourth, a single plot or a simple test value like AIC isn't sufficient to validate a survival model. For a parametric model, you need to verify that you chose the correct general form; for a PH model, you need to document how well the PH assumption holds. You need to check that you have included an appropriate set of predictors, possibly including interactions among them. For continuous predictors you need to document that you used an appropriate functional form. For the main survival
package there are several types of "residuals" available to help with such checks when event times are censored, but (as you probably know) the standard survreg()
function doesn't accept the counting-process data format used for time-varying covariates.
Harrell's course notes and book show ways to validate both PH and AFT models. To apply those techniques to parametric models from flexsurv
or the eha
package, which can accept counting-process data, you will have to do some coding. The only "residual" reported directly at this time by either package seems to be the "response" residual in version 2.0 of flexsurv
, the difference between predicted* and observed event times. That type of residual can be misleading if used improperly. More useful for many AFT models are the scaled residuals,* if you also keep information about censoring versus events at the observation times. That information is in the models, but for now you'll have to pull it out yourself.
Fifth, be very very careful about trying to make predictions involving time-varying covariates. Although your model was built from data including time-varying covariates, you only specified a single constant set of values for a
and b
that are assumed to hold from time = 0
to time = 0.5
in your example. That's OK. But if you start specifying sets of time-varying covariates for predictions you can get into serious logical problems. That's allowed by coxph()
models in R, but the author of the Python lifelines
package made a deliberate choice not to allow such predictions. It's easy for predictions based on time-dependent covariates to suffer from survivorship bias.
*AFT models can be written as
$$\log T = X' \beta + \sigma W $$
where $X' \beta$ is the linear predictor based on covariates $X$ and coefficients $\beta$, $W$ is a specific probability distribution associated with the type of AFT model, and $\sigma$ is a shape parameter. An estimated scaled residual is $(\log T - X' \hat\beta)/\hat\sigma$, which should have the same (censored) distribution as $W$.
If $W$ is a symmetric distribution, then with $W=0$ the linear predictor estimates mean $\log T$. If $W$ isn't symmetric--like with the minimum extreme value distribution assumed by a Weibull model--then you have to be careful just what is being "predicted" by a software package. It isn't necessarily the mean (log) survival time. Or it might be. Read the manual carefully.
Best Answer
Several thoughts:
First, even if you could generate a smooth baseline survival curve beyond 130 months, it would be unreliable. That's implicit in the large confidence bands around the survival curve at those times.
Second, the plateau in survival beyond 130 months is at about 85% survival! Furthermore, there are many censoring times along that plateau. It seems that you might need to model this as a "cure" model instead, in which there is "infinite" survival for a substantial fraction of cases. My initial reaction is that your data suggest no hazard of an event beyond about 130 months.
Third, your inability to fit simple accelerated failure time models might be due to the second thought. Such models assume that survival approaches zero at late times. Your data don't.
Fourth, the proportional hazard assumption makes it relatively straightforward to adjust a baseline hazard for covariate values. See this page among many others. For your approach to a smooth baseline curve, you could fit the estimated cumulative hazard $\hat H_0(t)$ for a defined baseline set of covariate values,* adjust the cumulative hazard for differences of covariate values from baseline, then use the relationship $S(t)=- \exp (H(t))$ to get the survival curve.
Fifth, it's possible to fit an arbitrary smooth baseline hazard as part of the modeling process. For example, the
flexsurv
package allows for the Royston-Parmar flexible spline modeling of baseline hazards; thescale = "hazard"
parameter setting does that in a proportional hazards setting. I suspect that's a better approach than yours of first fitting a Cox model and then smoothing the baseline hazard, but I have no experience with it.Sixth, even after you've dealt with those issues, you still have the fundamental problem of trying to make predictions based on time-varying covariates. This page and its links show the potentially circular reasoning that can be involved; such predictions might not make sense in some settings.
Seventh, none of this gets beyond the problems raised in the first thought.
*The baseline hazard reported for survival models is sometimes for a set of "average" values that might make no sense in practice when there are categorical covariates. Best to choose your own or to make sure that the software makes interpretable choices.