Solved – Survival Analysis on Rare Event Data predicts extremely high survival times

survival

I am trying to fit a parametric survival model (Accelerated Failure Model) to a data set in which most of the events are censored. Here are the details about the data:

dataCrimeSpread <- read.table("data.csv",header=TRUE, sep=",")
attach(dataCrimeSpread)

psymbol <- death+1
table(psymbol)

psymbol
      1      2 
 569732   2918 

So, there are 569732 censored events and only 2918 observed events.

A plot of the data looks like:

Description of input data (most censored at 300)

Then, I try to fit a basic regression model to this (without any covariates) as:

library(survival)
test <- survreg( Surv(time, death) ~ 1, dist="exponential")
summary(test)

The fitted model is:

Call:
survreg(formula = Surv(time, death) ~ 1, dist = "exponential")
            Value Std. Error   z p
(Intercept)    11     0.0185 593 0

Scale fixed at 1 

Exponential distribution
Loglik(model)= -34951.6   Loglik(intercept only)= -34951.6
Number of Newton-Raphson Iterations: 9 
n= 572650 

When I try to predict survival times based on this fitted model, I get results that are way higher than thr training data.

pred <- predict(test, type="response") 
testIndices <- sample(1:nrow(data),nrow(data))
predVals <- pred[testIndices]
predVals

Most of the predVals values are:

58566.95

which is significantly higher than the training values.

Best Answer

That sounds right.

Consider that in your data set, it seems that everything with t > 300 is censored. Also, about 99.5% of your data is censored. So a rough estimate of the 0.5th percentile is 300. Of course, the estimated median (which is what predict with type = 'response' is giving you) will be much larger than the estimated 0.5 percentile.

Whether you trust these estimates of the median is another question: this is making extremely heavy assumptions about the parametric form of the data, which are completely untestable, given that you've seen nothing past the 0.5th percentile. On the other hand, there's no other way to get the estimated median outside of heavy, untestable assumptions for this data.

Related Question