Solved – How to compare median survival between groups

multiple-comparisonssurvival

I'm looking into median survival using Kaplan-Meier in different states for a type of cancer. There are quite big differences between the states. How can i compare the median survival between all the states and determine which ones are significantly different from the mean median survival all across the country?

Best Answer

One thing to keep in mind with the Kaplan-Meier survival curve is that it is basically descriptive and not inferential. It is just a function of the data, with an incredibly flexible model that lies behind it. This is a strength because this means there is virtually no assumptions that might be broken, but a weakness because it is hard to generalise it, and that it fits "noise" as well as "signal". If you want to make an inference, then you basically have to introduce something that is unknown that you wish to know.

Now one way to compare the median survival times is to make the following assumptions:

  1. I have an estimate of the median survival time $t_{i}$ for each of the $i$ states, given by the kaplan meier curve.
  2. I expect the true median survival time, $T_{i}$ to be equal to this estimate. $E(T_{i}|t_{i})=t_{i}$
  3. I am 100% certain that the true median survival time is positive. $Pr(T_{i}>0)=1$

Now the "most conservative" way to use these assumptions is the principle of maximum entropy, so you get:

$$p(T_{i}|t_{i})= K exp(-\lambda T_{i})$$

Where $K$ and $\lambda$ are chosen such that the PDF is normalised, and the expected value is $t_{i}$. Now we have:

$$1=\int_{0}^{\infty}p(T_{i}|t_{i})dT_{i} =K \int_{0}^{\infty}exp(-\lambda T_{i})dT_{i} $$ $$=K \left[-\frac{exp(-\lambda T_{i})}{\lambda}\right]_{T_{i}=0}^{T_{i}=\infty}=\frac{K}{\lambda}\implies K=\lambda $$ and now we have $E(T_{i})=\frac{1}{\lambda}\implies \lambda=t_{i}^{-1}$

And so you have a set of probability distributions for each state.

$$p(T_{i}|t_{i})= \frac{1}{t_{i}} exp\left(-\frac{T_{i}}{t_{i}}\right)\;\;\;\;\;(i=1,\dots,N)$$

Which give a joint probability distribution of:

$$p(T_{1},T_{2},\dots,T_{N}|t_{1},t_{2},\dots,t_{N})= \prod_{i=1}^{N}\frac{1}{t_{i}} exp\left(-\frac{T_{i}}{t_{i}}\right)$$

Now it sounds like you want to test the hypothesis $H_{0}:T_{1}=T_{2}=\dots=T_{N}=\overline{t}$, where $\overline{t}=\frac{1}{N}\sum_{i=1}^{N}t_{i}$ is the mean median survivial time. The severe alternative hypothesis to test against is the "every state is a unique and beautiful snowflake" hypothesis $H_{A}:T_{1}=t_{1},\dots,T_{N}=t_{N}$ because this is the most likely alternative, and thus represents the information lost in moving to the simpler hypothesis (a "minimax" test). The measure of the evidence against the simpler hypothesis is given by the odds ratio:

$$O(H_{A}|H_{0})=\frac{p(T_{1}=t_{1},T_{2}=t_{2},\dots,T_{N}=t_{N}|t_{1},t_{2},\dots,t_{N})}{ p(T_{1}=\overline{t},T_{2}=\overline{t},\dots,T_{N}=\overline{t}|t_{1},t_{2},\dots,t_{N})}$$ $$=\frac{ \left[\prod_{i=1}^{N}\frac{1}{t_{i}}\right] exp\left(-\sum_{i=1}^{N}\frac{t_{i}}{t_{i}}\right) }{ \left[\prod_{i=1}^{N}\frac{1}{t_{i}}\right] exp\left(-\sum_{i=1}^{N}\frac{\overline{t}}{t_{i}}\right) } =exp\left(N\left[\frac{\overline{t}}{t_{harm}}-1\right]\right)$$

Where

$$t_{harm}=\left[\frac{1}{N}\sum_{i=1}^{N}t_{i}^{-1}\right]^{-1}\leq \overline{t}$$

is the harmonic mean. Note that the odds will always favour the perfect fit, but not by much if the median survival times are reasonably close. Further, this gives you a direct way to state the evidence of this particular hypothesis test:

assumptions 1-3 give maximum odds of $O(H_{A}|H_{0}):1$ against equal median survival times across all states

Combine this with a decision rule, loss function, utility function, etc. which says how advantageous it is to accept the simpler hypothesis, and you've got your conclusion!

There is no limit to the amount of hypothesis you can test for, and give similar odds for. Just change $H_{0}$ to specify a different set of possible "true values". You could do "significance testing" by choosing the hypothesis as:

$$H_{S,i}:T_{i}=t_{i},T_{j}=T=\overline{t}_{(i)}=\frac{1}{N-1}\sum_{j\neq i}t_{j}$$

So this hypothesis is verbally "state $i$ has different median survival rate, but all other states are the same". And then re-do the odds ratio calculation I did above. Although you should be careful about what the alternative hypothesis is. For any one of these below is "reasonable" in the sense that they might be questions you are interested in answering (and they will generally have different answers)

  • my $H_{A}$ defined above - how much worse is $H_{S,i}$ compared to the perfect fit?
  • my $H_{0}$ defined above - how much better is $H_{S,i}$ compared to the average fit?
  • a different $H_{S,k}$ - how much is state $k$ "more different" compared to state $i$?

Now one thing which has been over-looked here is correlations between states - this structure assumes that knowing the median survival rate in one state tells you nothing about the median survival rate in another state. While this may seem "bad" it is not to difficult to improve on, and the above calculations are good initial results which are easy to calculate.

Adding connections between states will change the probability models, and you will effectively see some "pooling" of the median survival times. One way to incorporate correlations into the analysis is to separate the true survival times into two components, a "common part" or "trend" and an "individual part":

$$T_{i}=T+U_{i}$$

And then constrain the individual part $U_{i}$ to have average zero over all units and unknown variance $\sigma$ to be integrated out using a prior describing what knowledge you have of the individual variability, prior to observing the data (or jeffreys prior if you know nothing, and half cauchy if jeffreys causes problems).