Yes, I have checked that previous answers to "Ran out of iterations…" questions do not solve my problem.
I have fault data on Firefox, 899 faults and 1395 (estimated) censored faults. The censoring all happens on one of half a dozen start days and half a dozen end days (the initial/final release of a version).
library(survival)
ff_usage=read.csv("http://www.coding-guidelines.com/R_code/ff_usage.csv", as.is=TRUE)
f_sur=Surv(ff_usage$start, ff_usage$end, event=ff_usage$event)
plot(survfit(f_sur ~ 1))
f_cox=coxph(f_sur ~ total_usage+cluster(fault_id), data=ff_usage)
The Kaplan-Meier curve looks about right.
total_usage
is an estimate of the number of Firefox users up until the fault is reported. This is very time dependent and so each fault timeline is broken up into 7 day intervals clustered on fault_id
; unsplit original.
The dependency on total_usage
(or its log) could be close to 1 (I am hoping for one or the other).
I have tried setting init
and increasing iter.max
; also strata(src_id)
and subsetting on src_id
.
Most of the start/end times are estimated and have a regular interval, I have tried adding some randomization, e.g., runif(n, -3, 3)
. No change.
All I ever see is:
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
Suggestions for things to try welcome.
Best Answer
This may be a case where, as the
coxph()
documentation page puts it, "the actual MLE estimate of a coefficient is infinity" so that "the associated coefficient grows at a steady pace and a race condition will exist in the fitting routine." In particular, close interrelations of the start / end times with thetotal_usage
variable may be the problem here.When I have problems with a continuous predictor variable like your
total_usage
in survival analysis, I examine a split of the continuous variable at the median. Look at survival curves from your data based on a split oftotal_usage
at its median value of $5866.2$ (thecoxph()
for this simple analysis also didn't converge):Looks like almost all censoring times and events for the low
total_usage
cases are before something liketime=700
, while almost all events and censoring times for the hightotal_usage
subset are greater than that time. Also, examining:may provide some insight. My data sets are typically much smaller than this, but I have run into related problems in Cox analysis with "a dichotomous variable where one of the groups has no events," so that hazard ratios are ill-defined.
Hope this helps point you in the right direction.