Recently I'm using R survival package to try to predict the probability of people going to churn. I found some examples on stack overflow and also tried that to my own data. Here is my prediction model and output:
> Status_by_Time <- Surv(time = Duration, event = Gift_Status_ind)
> model.fit2 <- survreg(Status_by_Time ~ Age
+ Gender_ind
+ Fundraiser_ind
+ Monthly.Recurring.Amount
+ Frequency_ind
+ Monthly.first.gift.amount
+ Monthly.last.gift.amount
#+ Duration
#+ Saved.
+ Upgrade.first.time
+ Upgrade.second.time,
dist = "logistic"
)
> summary(model.fit2)
Value Std. Error z p
(Intercept) 81.525 1.46725 55.56 0.00e+00
Age 0.156 0.01889 8.27 1.33e-16
Gender_ind2 2.278 0.55955 4.07 4.68e-05
Gender_ind3 -9.514 1.09689 -8.67 4.18e-18
Fundraiser_ind2 -8.798 0.69303 -12.70 6.25e-37
Fundraiser_ind3 4.028 0.90970 4.43 9.52e-06
Monthly.Recurring.Amount -1.211 0.04856 -24.95 2.39e-137
Frequency_ind2 257.319 0.00000 Inf 0.00e+00
Frequency_ind3 8.562 2.71423 3.15 1.61e-03
Frequency_ind4 -89.067 1.39379 -63.90 0.00e+00
Monthly.first.gift.amount -2.538 0.03721 -68.22 0.00e+00
Monthly.last.gift.amount 1.827 0.04981 36.67 2.38e-294
Upgrade.first.time 6.467 0.82381 7.85 4.15e-15
Upgrade.second.time 10.849 2.72927 3.98 7.04e-05
Log(scale) 2.869 0.00841 341.02 0.00e+00
Scale= 17.6
Logistic distribution
Loglik(model)= -51841.8 Loglik(intercept only)= -55404
Chisq= 7124.45 on 13 degrees of freedom, p= 0
Number of Newton-Raphson Iterations: 8
n= 18097
> predicted.values <- predict(model.fit2, newdata = churn.df.trim, type = "quantile", p = (1:9)/10) # 10 times event???
> head(predicted.values)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 2.219425 16.513993 26.01508 33.80343 40.95072 48.09800 55.88635 65.38744 79.68201
[2,] 11.088257 25.382825 34.88392 42.67227 49.81955 56.96683 64.75518 74.25627 88.55084
[3,] -11.996590 2.297977 11.79907 19.58742 26.73470 33.88198 41.67033 51.17143 65.46599
[4,] 5.456971 19.751539 29.25263 37.04098 44.18826 51.33555 59.12390 68.62499 82.91955
[5,] 19.690749 33.985316 43.48641 51.27476 58.42204 65.56932 73.35767 82.85876 97.15333
[6,] -8.187469 6.107099 15.60819 23.39654 30.54382 37.69111 45.47946 54.98055 69.27511
I reckon all these numbers are not probabilities. Is there some way to interpret these numbers or turn them into probabilities? Also if I use p = (1:9)/10
does that mean I'm calculating the probability for the next 9 or 10 period?
Much appreciate if someone could give me a straight forward explanation (none academic one).
Best Answer
You are fitting an Accelerated Failure Time (AFT) model, so all the interpretation should be made in light of this model. I must admit I am have never used this type of models, but perhaps this can help. The model is described as $$\log(T) = \mu_0 - X\beta + \epsilon$$ This means that the prediction is of the answer on the left-hand side. In
?predict.survreg
, fortype
it says that:Meaning that with the specification that you made you get quantiles of the prediction on the time-scale. With $p=0.1$ for example, you get the $t$ for which $$P(T\leq t)=0.1$$
Then the question would be why you get negative predictions for some of the quantiles, which seems unreasonable. I think this has something to do with the logistic assumption.