Solved – coxph ran out of iterations and did not converge

cox-modelrsurvival

Yes, I have checked that previous answers to "Ran out of iterations…" questions do not solve my problem.

I have fault data on Firefox, 899 faults and 1395 (estimated) censored faults. The censoring all happens on one of half a dozen start days and half a dozen end days (the initial/final release of a version).

library(survival)

ff_usage=read.csv("http://www.coding-guidelines.com/R_code/ff_usage.csv", as.is=TRUE)

f_sur=Surv(ff_usage$start, ff_usage$end, event=ff_usage$event)
plot(survfit(f_sur ~ 1))
f_cox=coxph(f_sur ~ total_usage+cluster(fault_id), data=ff_usage)

The Kaplan-Meier curve looks about right.

total_usage is an estimate of the number of Firefox users up until the fault is reported. This is very time dependent and so each fault timeline is broken up into 7 day intervals clustered on fault_id; unsplit original.

The dependency on total_usage (or its log) could be close to 1 (I am hoping for one or the other).

I have tried setting init and increasing iter.max; also strata(src_id) and subsetting on src_id.

Most of the start/end times are estimated and have a regular interval, I have tried adding some randomization, e.g., runif(n, -3, 3). No change.

All I ever see is:

Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights,  :
  Ran out of iterations and did not converge

Suggestions for things to try welcome.

Best Answer

This may be a case where, as the coxph() documentation page puts it, "the actual MLE estimate of a coefficient is infinity" so that "the associated coefficient grows at a steady pace and a race condition will exist in the fitting routine." In particular, close interrelations of the start / end times with the total_usage variable may be the problem here.

When I have problems with a continuous predictor variable like your total_usage in survival analysis, I examine a split of the continuous variable at the median. Look at survival curves from your data based on a split of total_usage at its median value of $5866.2$ (the coxph() for this simple analysis also didn't converge):

plot(survfit(f_sur~(total_usage > 5866.2),data=ff_usage))

Looks like almost all censoring times and events for the low total_usage cases are before something like time=700, while almost all events and censoring times for the high total_usage subset are greater than that time. Also, examining:

summary(survfit(f_sur~(total_usage > 5866.2),data=ff_usage))

may provide some insight. My data sets are typically much smaller than this, but I have run into related problems in Cox analysis with "a dichotomous variable where one of the groups has no events," so that hazard ratios are ill-defined.

Hope this helps point you in the right direction.