What does the “weights” argument do when fitting Kaplan-Meier curveswith the survival package

kaplan-meierrsurvivalweighted-data

In R's survival package, there is an optional weights argument you can supply when you fit a Kaplan-Meier curve. I can't find any documentation about what this does or what exactly a weighted Kaplan Meier estimate is.

I'm aware there are modification to Kaplan-Meier estimates for ties. For this question, any clarification about the basic estimator is fine,
$$ \hat{S}(t) = \prod_{i: \space t_i \leq t} \bigg(1 – \frac{d_i}{n_i} \bigg) $$

How do weights change the Kaplan Meier estimate above? I don't see how a weight for every subject, $i$, could enter the model above.

If this were a model with iid observations and a well defined likelihood function, I would get parameter estimates by maximizing the log likelihood,
$$\theta^* = \arg\max\limits_\theta \sum_i w_i \log(\mathcal{L}(\theta|x_i)),$$
with a weight $w_i$ for every observation. What is the analog for survival models, where terms in the likelihood are multiplied over the risk set?

Is there an interpretation in terms of $d_i$ and $n_i$ of how adding a weight, $w_i$, changes the Kaplan-Meier estimates

Best Answer

Weights in a survival model give you flexibility in terms of data formatting or a way to try to adjust estimates for sampling that wasn't representative. Therneau and Grambsch say, in Section 7.3 of Modeling Survival Data--Extending the Cox Model (Springer, 2000):

Two distinct uses for case weights (among many uses) need to be distinguished. The first is jrequency weights; a weight of 3 means that 3 data points were actually observed, had the same values for all variables, and have been collapsed into a single observation to save space. The program should then treat an observation with a weight of k as if it had appeared k times in the input data set. The second is sampling weights. For instance, if 10% of the high-risk subjects for a condition were included in a study but only 1% of those with low or moderate risk, we would want to weight the observations inversely as the sampling fractions to reflect this design, giving case weights 10 times greater to the low/moderate-risk individuals than to the high-risk ones.

For a Kaplan-Meier estimate, you could just weight both the deaths and the numbers at risk at each event time $i$ by the individual case weights. Think, for example, about the first example in the quote above: for a case weight of 2, you just double-count the weighted case in the denominator so long as it is at risk, and give it a count of 2 in the numerator at its event time. I'm not sure that's how it's implemented in the survival package; you could check by examining the C source code for Csurvfitkm, which does the main calculations. Perhaps Thomas Lumley, who used to maintain the package, could discuss further.

For the Cox partial likelihood solution, it's essentially what you propose. For the unweighted situation, differentiating the log partial likelihood with respect to the parameter-value vector $\theta$ gives a score vector (Equation 3.4 of Therneau and Grambsch):

$$ U(\theta) = \sum_{i=1}^n \int_0^{\infty} \left[X_i(s) - \bar x(\theta,s)\right] dN_i(s) = \sum_{i=1}^n U_i(\theta)$$

where $X_i$ represents the covariate values for case $i$ and $\bar x$ is a risk-weighted mean of $X$ over observations at risk.* The maximum partial likelihood estimator $\hat \theta$ solves:

$$\sum_{i=1}^n U_i(\hat \theta) = 0. $$

With case weights $w_i$, you instead solve:

$$\sum_{i=1}^n w_i U_i(\hat \theta) = 0. $$

while also case-weighting the contributions to $\bar x(\theta,s)$* in the score vector. Handling of variances and the information matrix is similar. See Section 7.3 of Therneau and Grambsch.

Note that the coxph default Efron approximation for tied event times is implemented via temporary case weights even in an unweighted Cox regression; see Section 5.1 of the main R survival vignette.

Case weights do affect some other calculations in coxph(). For example, non-integer case weights (as you might have in inverse propensity score weighting) lead to calculation of a robust variance estimate; see Section 2.7 of that vignette.


*The risk score for case $i$ in a regression without case weights is $r_i(\theta,s) =\exp[\theta' X_i(s)]$. Then the risk-weighted covariate average is:

$$\bar x(\theta,s) = \frac{\sum Y_i(s) r_i(s)X_i(s)}{\sum Y_i(s) r_i(s)}, $$

where $Y_i(s)$ is the at-risk indicator for time $s$. In the case-weighted regression, $r_i$ becomes $w_i r_i$.

Related Question