Solved – How to interpret the output of predict() in R survival package

predictionrsurvival

Recently I'm using R survival package to try to predict the probability of people going to churn. I found some examples on stack overflow and also tried that to my own data. Here is my prediction model and output:

> Status_by_Time <- Surv(time = Duration, event = Gift_Status_ind)
> model.fit2 <- survreg(Status_by_Time ~ Age
                + Gender_ind 
                + Fundraiser_ind
                + Monthly.Recurring.Amount
                + Frequency_ind
                + Monthly.first.gift.amount
                + Monthly.last.gift.amount
                #+ Duration
                #+ Saved.
                + Upgrade.first.time
                + Upgrade.second.time,
                dist = "logistic"
)

> summary(model.fit2)
                            Value Std. Error      z         p
(Intercept)                81.525    1.46725  55.56  0.00e+00
Age                         0.156    0.01889   8.27  1.33e-16
Gender_ind2                 2.278    0.55955   4.07  4.68e-05
Gender_ind3                -9.514    1.09689  -8.67  4.18e-18
Fundraiser_ind2            -8.798    0.69303 -12.70  6.25e-37
Fundraiser_ind3             4.028    0.90970   4.43  9.52e-06
Monthly.Recurring.Amount   -1.211    0.04856 -24.95 2.39e-137
Frequency_ind2            257.319    0.00000    Inf  0.00e+00
Frequency_ind3              8.562    2.71423   3.15  1.61e-03
Frequency_ind4            -89.067    1.39379 -63.90  0.00e+00
Monthly.first.gift.amount  -2.538    0.03721 -68.22  0.00e+00
Monthly.last.gift.amount    1.827    0.04981  36.67 2.38e-294
Upgrade.first.time          6.467    0.82381   7.85  4.15e-15
Upgrade.second.time        10.849    2.72927   3.98  7.04e-05
Log(scale)                  2.869    0.00841 341.02  0.00e+00

Scale= 17.6 

Logistic distribution
Loglik(model)= -51841.8   Loglik(intercept only)= -55404
Chisq= 7124.45 on 13 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 8 
n= 18097 

> predicted.values <- predict(model.fit2, newdata = churn.df.trim, type = "quantile", p = (1:9)/10) # 10 times event???
> head(predicted.values)
            [,1]      [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]
 [1,]   2.219425 16.513993 26.01508 33.80343 40.95072 48.09800 55.88635 65.38744 79.68201
 [2,]  11.088257 25.382825 34.88392 42.67227 49.81955 56.96683 64.75518 74.25627 88.55084
 [3,] -11.996590  2.297977 11.79907 19.58742 26.73470 33.88198 41.67033 51.17143 65.46599
 [4,]   5.456971 19.751539 29.25263 37.04098 44.18826 51.33555 59.12390 68.62499 82.91955
 [5,]  19.690749 33.985316 43.48641 51.27476 58.42204 65.56932 73.35767 82.85876 97.15333
 [6,]  -8.187469  6.107099 15.60819 23.39654 30.54382 37.69111 45.47946 54.98055 69.27511

I reckon all these numbers are not probabilities. Is there some way to interpret these numbers or turn them into probabilities? Also if I use p = (1:9)/10 does that mean I'm calculating the probability for the next 9 or 10 period?

Much appreciate if someone could give me a straight forward explanation (none academic one).

Best Answer

You are fitting an Accelerated Failure Time (AFT) model, so all the interpretation should be made in light of this model. I must admit I am have never used this type of models, but perhaps this can help. The model is described as $$\log(T) = \mu_0 - X\beta + \epsilon$$ This means that the prediction is of the answer on the left-hand side. In ?predict.survreg, for type it says that:

This can be on the original scale of the data (response), the linear predictor ("linear", with "lp" as an allowed abbreviation), a predicted quantile on the original scale of the data ("quantile"), a quantile on the linear predictor scale ("uquantile"), or the matrix of terms for the linear predictor ("terms"). At this time "link" and linear predictor ("lp") are identical.

Meaning that with the specification that you made you get quantiles of the prediction on the time-scale. With $p=0.1$ for example, you get the $t$ for which $$P(T\leq t)=0.1$$

Then the question would be why you get negative predictions for some of the quantiles, which seems unreasonable. I think this has something to do with the logistic assumption.