You need to make days_ind
a time-dependent variable. The way you have it coded right now, everybody whose observation (whether event or censoring time) was after 110 days will have experienced a different hazard throughout their entire followup then those whose observation is before 110 days. What you want to have is for the hazard to "jump" at 110 days.
It is not completely straightforward to set up this analysis in the survival
package. You have to split the follow-up period of each person into two periods: up to 110 days and after that. Anybody surviving beyond 110 days would have two observations: one right-censored at 110, and the other left-truncated at 110 and having the actual event on the right side. Fortunately, there is a function to do exactly that: survSplit
.
Here is a quick example with a built-in dataset:
> library(survival)
> aml$id <- 1:nrow(aml) # add a subject ID variable
> aml2 <- survSplit(aml,cut=10,end="time",start="start", event="status", episode="period")
>
> subset(aml, id<=3)
time status x id
1 9 1 Maintained 1
2 13 1 Maintained 2
3 13 0 Maintained 3
> subset(aml2, id<=3)
time status x id start period
1 9 1 Maintained 1 0 0
2 10 0 Maintained 2 0 0
3 10 0 Maintained 3 0 0
25 13 1 Maintained 2 10 1
26 13 0 Maintained 3 10 1
You can see that there are now two observations for id's 2 and 3. The period
variable corresponds to your days_ind
.
From here you can build the model you want, but you have to code the effects carefully, because the effect of period
cannot be estimated, since this it refers to different times.
> fit <- coxph(Surv(start, time, status) ~
I((x=="Maintained")&(period==0)) + I((x=="Maintained")&(period==1)), data=aml2)
> fit
Call:
coxph(formula = Surv(start, time, status) ~ I((x == "Maintained") &
(period == 0)) + I((x == "Maintained") & (period == 1)),
data = aml2)
coef exp(coef) se(coef) z p
I((x == "Maintained") & (period == 0))TRUE -1.498 0.224 1.120 -1.34 0.18
I((x == "Maintained") & (period == 1))TRUE -0.722 0.486 0.591 -1.22 0.22
Likelihood ratio test=3.79 on 2 df, p=0.150 n= 41, number of events= 18
Here the two coefficients measure the effect of Maintained vs Non-maintained before 10 days and after 10 days, respectively.
You could also consider using the cmprsk
package. It is designed for analysis of competing risks, but there is nothing stopping you from using it for only one outcome. The benefit is that it has an easier way of defining time-dependent covariates (though a really awkward syntax overall):
> library(cmprsk)
> fit1 <- with(aml, crr(time, status, cov1=I(x=="Maintained"), cov2=I(x=="Maintained"),
+ tf=function(t)I(t<=10)))
> fit1
convergence: TRUE
coefficients:
I(x == "Maintained")1 I(x == "Maintained")1*tf1
-0.7213 -0.7387
standard errors:
[1] 0.5259 1.1500
two-sided p-values:
I(x == "Maintained")1 I(x == "Maintained")1*tf1
0.17 0.52
Note that with the different coding, the meaning of the coefficients is not exactly the same as above.
It is not clear to me how you generate your event times (which, in your case, might be $<0$) and event indicators:
time = rnorm(n,10,2)
S_prob = S(time)
event = ifelse(runif(1)>S_prob,1,0)
So here is a generic method, followed by some R code.
Generating survival times to simulate Cox proportional hazards models
To generate event times from the proportional hazards model, we can use the inverse probability method (Bender et al., 2005): if $V$ is uniform on $(0, 1)$ and if $S(\cdot \,|\, \mathbf{x})$ is the conditional survival function derived from the proportional hazards model, i.e.
$$
S(t \,|\, \mathbf{x}) = \exp \left( -H_0(t) \exp(\mathbf{x}^\prime \mathbf{\beta}) \vphantom{\Big(} \right)
$$
then it is a fact that the random variable
$$
T = S^{-1}(V \,|\, \mathbf{x}) = H_0^{-1} \left( - \frac{\log(V)}{\exp(\mathbf{x}^\prime \mathbf{\beta})} \right)
$$
has survival function $S(\cdot \,|\, \mathbf{x})$. This result is known as ``the inverse probability integral transformation''. Therefore, to generate a survival time $T \sim S(\cdot \,|\, \mathbf{x})$ given the covariate vector, it suffices to draw $v$ from $V \sim \mathrm{U}(0, 1)$ and to make the inverse transformation $t = S^{-1}(v \,|\, \mathbf{x})$.
Example [Weibull baseline hazard]
Let $h_0(t) = \lambda \rho t^{\rho - 1}$ with shape $\rho > 0$ and scale $\lambda > 0$. Then $H_0(t) = \lambda t^\rho$ and $H^{-1}_0(t) = (\frac{t}{\lambda})^{\frac{1}{\rho}}$. Following the inverse probability method, a realisation of $T \sim S(\cdot \,|\, \mathbf{x})$ is obtained by computing
$$
t = \left( - \frac{\log(v)}{\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})} \right)^{\frac{1}{\rho}}
$$
with $v$ a uniform variate on $(0, 1)$. Using results on transformations of random variables, one may notice that $T$ has a conditional Weibull distribution (given $\mathbf{x}$) with shape $\rho$ and scale $\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})$.
R code
The following R function generates a data set with a single binary covariate $x$ (e.g. a treatment indicator). The baseline hazard has a Weibull form. Censoring times are randomly drawn from an exponential distribution.
# baseline hazard: Weibull
# N = sample size
# lambda = scale parameter in h0()
# rho = shape parameter in h0()
# beta = fixed effect parameter
# rateC = rate parameter of the exponential distribution of C
simulWeib <- function(N, lambda, rho, beta, rateC)
{
# covariate --> N Bernoulli trials
x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))
# Weibull latent event times
v <- runif(n=N)
Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)
# censoring times
C <- rexp(n=N, rate=rateC)
# follow-up times and event indicators
time <- pmin(Tlat, C)
status <- as.numeric(Tlat <= C)
# data set
data.frame(id=1:N,
time=time,
status=status,
x=x)
}
Test
Here is some quick simulation with $\beta = -0.6$:
set.seed(1234)
betaHat <- rep(NA, 1e3)
for(k in 1:1e3)
{
dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001)
fit <- coxph(Surv(time, status) ~ x, data=dat)
betaHat[k] <- fit$coef
}
> mean(betaHat)
[1] -0.6085473
Best Answer
Since the hazard depends on the covariates, so does the survival function. The model assumes that the hazard function of an individual with covariate vector $x$ is $$ h(t;x) = h_0(t) e^{\beta'x}. $$ Hence, the cumulative hazard of this individual is $$ H(t;x) = \int_0^t h(u;x) du=\int_0^t h_0(u) e^{\beta'x} du = H_0(t)e^{\beta'x}, $$ where we may define $H_0(t)=\int_0^t h_0(u) du$ as the baseline cumulative hazard. The survival function for an individual with covariate vector $x$ is in turn $$ S(t;x) = e^{-H(t;x)}=e^{-H_0 e^{\beta'x}}=S_0(t)^{e^{\beta'x}} $$ where we define $S_0(t) = e^{-H_0(t)}$ as the baseline survival function.
Given estimates $\hat\beta$ and $\hat S_0(t)$ of the regression coefficients and the baseline survival function, an estimate the survival function for an individual with covariate vector $x$ is given by $\hat S(t;x)=\hat S_0(t)^{e^{\hat\beta'x}}$.
To compute this in R you specify the value of your covariates in the
newdata
argument. For example if you want the survival function for individuals of age=70, doIf you, as you do, omit the
newdata
argument, its default value equals the average values of the covariates in the sample (see?survfit.coxph
). So what is shown in your graph is an estimate of $S_0(t)^{e^{\beta'\bar x}}$.