In R's survival package, there is an optional weights argument you can supply when you fit a Kaplan-Meier curve. I can't find any documentation about what this does or what exactly a weighted Kaplan Meier estimate is.
I'm aware there are modification to Kaplan-Meier estimates for ties. For this question, any clarification about the basic estimator is fine,
$$ \hat{S}(t) = \prod_{i: \space t_i \leq t} \bigg(1 – \frac{d_i}{n_i} \bigg) $$
How do weights change the Kaplan Meier estimate above? I don't see how a weight for every subject, $i$, could enter the model above.
If this were a model with iid observations and a well defined likelihood function, I would get parameter estimates by maximizing the log likelihood,
$$\theta^* = \arg\max\limits_\theta \sum_i w_i \log(\mathcal{L}(\theta|x_i)),$$
with a weight $w_i$ for every observation. What is the analog for survival models, where terms in the likelihood are multiplied over the risk set?
Is there an interpretation in terms of $d_i$ and $n_i$ of how adding a weight, $w_i$, changes the Kaplan-Meier estimates
Best Answer
Weights in a survival model give you flexibility in terms of data formatting or a way to try to adjust estimates for sampling that wasn't representative. Therneau and Grambsch say, in Section 7.3 of Modeling Survival Data--Extending the Cox Model (Springer, 2000):
For a Kaplan-Meier estimate, you could just weight both the deaths and the numbers at risk at each event time $i$ by the individual case weights. Think, for example, about the first example in the quote above: for a case weight of 2, you just double-count the weighted case in the denominator so long as it is at risk, and give it a count of 2 in the numerator at its event time. I'm not sure that's how it's implemented in the
survival
package; you could check by examining the C source code forCsurvfitkm
, which does the main calculations. Perhaps Thomas Lumley, who used to maintain the package, could discuss further.For the Cox partial likelihood solution, it's essentially what you propose. For the unweighted situation, differentiating the log partial likelihood with respect to the parameter-value vector $\theta$ gives a score vector (Equation 3.4 of Therneau and Grambsch):
$$ U(\theta) = \sum_{i=1}^n \int_0^{\infty} \left[X_i(s) - \bar x(\theta,s)\right] dN_i(s) = \sum_{i=1}^n U_i(\theta)$$
where $X_i$ represents the covariate values for case $i$ and $\bar x$ is a risk-weighted mean of $X$ over observations at risk.* The maximum partial likelihood estimator $\hat \theta$ solves:
$$\sum_{i=1}^n U_i(\hat \theta) = 0. $$
With case weights $w_i$, you instead solve:
$$\sum_{i=1}^n w_i U_i(\hat \theta) = 0. $$
while also case-weighting the contributions to $\bar x(\theta,s)$* in the score vector. Handling of variances and the information matrix is similar. See Section 7.3 of Therneau and Grambsch.
Note that the
coxph
default Efron approximation for tied event times is implemented via temporary case weights even in an unweighted Cox regression; see Section 5.1 of the main R survival vignette.Case weights do affect some other calculations in
coxph()
. For example, non-integer case weights (as you might have in inverse propensity score weighting) lead to calculation of a robust variance estimate; see Section 2.7 of that vignette.*The risk score for case $i$ in a regression without case weights is $r_i(\theta,s) =\exp[\theta' X_i(s)]$. Then the risk-weighted covariate average is:
$$\bar x(\theta,s) = \frac{\sum Y_i(s) r_i(s)X_i(s)}{\sum Y_i(s) r_i(s)}, $$
where $Y_i(s)$ is the at-risk indicator for time $s$. In the case-weighted regression, $r_i$ becomes $w_i r_i$.