Solved – Survival Analysis on Rare Event Data predicts extremely high survival times

survival

I am trying to fit a parametric survival model (Accelerated Failure Model) to a data set in which most of the events are censored. Here are the details about the data:

dataCrimeSpread <- read.table("data.csv",header=TRUE, sep=",")
attach(dataCrimeSpread)

psymbol <- death+1
table(psymbol)

psymbol
      1      2 
 569732   2918

So, there are 569732 censored events and only 2918 observed events.

A plot of the data looks like:

Then, I try to fit a basic regression model to this (without any covariates) as:

library(survival)
test <- survreg( Surv(time, death) ~ 1, dist="exponential")
summary(test)

The fitted model is:

Call:
survreg(formula = Surv(time, death) ~ 1, dist = "exponential")
            Value Std. Error   z p
(Intercept)    11     0.0185 593 0

Scale fixed at 1 

Exponential distribution
Loglik(model)= -34951.6   Loglik(intercept only)= -34951.6
Number of Newton-Raphson Iterations: 9 
n= 572650

When I try to predict survival times based on this fitted model, I get results that are way higher than thr training data.

pred <- predict(test, type="response") 
testIndices <- sample(1:nrow(data),nrow(data))
predVals <- pred[testIndices]
predVals

Most of the predVals values are:

58566.95

which is significantly higher than the training values.

Best Answer

That sounds right.

Consider that in your data set, it seems that everything with t > 300 is censored. Also, about 99.5% of your data is censored. So a rough estimate of the 0.5th percentile is 300. Of course, the estimated median (which is what predict with type = 'response' is giving you) will be much larger than the estimated 0.5 percentile.

Whether you trust these estimates of the median is another question: this is making extremely heavy assumptions about the parametric form of the data, which are completely untestable, given that you've seen nothing past the 0.5th percentile. On the other hand, there's no other way to get the estimated median outside of heavy, untestable assumptions for this data.

Related Solutions

R – Relationship between Gumbel and Weibull Distributions and Survival Analysis

The confusion comes from competing definitions of "Gumbel distribution" and competing parameterizations of the Weibull distribution.

(1) It might be best to avoid the term "Gumbel distribution" because it has different interpretations.

One is a maximum extreme value distribution, the definition used in Wikipedia. "This article uses the Gumbel distribution to model the distribution of the maximum value." (Emphasis in original.)

Another is a minimum extreme value distribution, the definition provided by Wolfram. "In this work, the term 'Gumbel distribution' is used to refer to the distribution corresponding to a minimum extreme value distribution." (Emphasis added.) That is used by Mathematica for its GumbelDistribution, which calls the Wikipedia maximum extreme value version the ExtremeValueDistribution.

It's the minimum extreme value version that provides the "standard result" for the association between Weibull and Gumbel distributions. As you used the maximum extreme value version, you got the result that you found.

(2) Continuing from point (1), to make this work you have to alter (a) the relationship between $\alpha$ and $\beta$ to get a mean of 0, and (b) the CDF to match the minimum extreme value Gumbel.

(a) The mean of the minimum extreme value version is $\alpha - \gamma \beta$, where $\gamma$ is Euler's gamma, with $\alpha$ and $\beta$ as represented in the question. That's different from $\alpha + \gamma \beta$ for the maximum extreme value version, as used in the question.

(b) The $q$th quantile (inverse CDF) of the minimum extreme value version is:

$$\alpha +\beta \log (-\log (1-q)).$$

The inverse CDF used in the question's code is for the maximum extreme value version.

I haven't yet done those replacements in the code, but I suspect that (absent other problems) all with then be OK.

(3) The question of "exactly what distribution specification R is using when fitting a Weibull distribution" is not well specified.

R packages can differ in parameterizations, and the same function might use different parameterizations depending on the arguments in the function call. This page provides some examples. Notably, as the manual page for the survreg() function in the survival package explains:

There are multiple ways to parameterize a Weibull distribution. The survreg function embeds it in a general location-scale family, which is a different parameterization than the rweibull function, and often leads to confusion.

survreg's scale  =    1/(rweibull shape)
survreg's intercept = log(rweibull scale)

I don't see any way around these types of confusions, except to be extremely careful in reading specific definitions and manual pages.

Solved – Calculating constant hazards in exponential survival distributions in R using survreg()

Yes that should be the estimate of the constant hazard rate.

To my understanding, the model is of the form $\log T = \alpha + W$, so $\alpha$ should represent the log of the (population) mean survival time. For an exponential model at least, 1/mean.survival will be the hazard rate, so I believe you're correct. As a result, $\exp(-\hat{\alpha})$ should be the MLE of the constant hazard rate.

Presumably those times are days, in which case that estimate would be the instantaneous hazard rate (on the per-day scale). [edit: no turns out they're minutes so substitute minute for day everywhere above]

Best Answer

Related Solutions

R – Relationship between Gumbel and Weibull Distributions and Survival Analysis

Solved – Calculating constant hazards in exponential survival distributions in R using survreg()

Related Question