We made a cox regression and ended up with huge HR for some of our variables. One of them (which was an interaction) gave us a HR=2747,093. Our dataset consists of only 74 observations. What is wrong?
Solved – Extremely huge Hazard Ratios from Cox regression
cox-modelhazard
Related Solutions
The problem with the Cox model is that it predicts nothing. The "intercept" (baseline hazard function) in Cox models is never actually estimated. Logistic regression can be used to predict the risk or probability for some event, in this case: whether or not a subject comes in to buy something on a specific month.
The problem with the assumptions behind ordinary logistic regression is that you treat each person-month observation as independent, regardless of whether it was the same person or the same month in which observations occurred. This can be dangerous because some items are bought in two month intervals, so consecutive person by month observations are negatively correlated. Alternately, a customer can be retained or lost by good or bad experiences leading consecutive person by month observations are positively correlated.
I think a good start to this prediction problem is taking the approach of forecasting where we can use previous information to inform our predictions about the next month's business. A simple start to this problem is adjusting for a lagged effect, or an indicator of whether a subject had arrived in the last month, as a predictor of whether they might arrive this month.
(1) One HR for multi-valued categorical variables
In the cited paper on colorectal cancer, some variables were listed as multi-valued categorical but only had single hazard ratios (HRs) presented in Cox proportional-hazards survival analysis. The variables in question were: T (tumor size), N (tumor spread to lymph nodes), TNM stage (a combined assessment of disease severity), and the modified Glasgow Prognostic Score (mGPS). These are ordered clinical classifications presented as integer values, following standard practice (although TNM stage is usually represented by Roman numerals and in some types of cancers there are subtypes a, b, etc, for certain classifications).
A single HR can be obtained for each of these classifications if they are treated as numeric values in the Cox regression. If each extra step along an ordered categorical scale contributes the same multiplicative change in hazard ratio, then there is no harm in treating that categorical variable as numeric.
Analysis as numeric variables is almost certainly why only a single HR is presented for each of these classifications; what is uncertain from my reading of the paper is whether this was a deliberate or an accidental choice. In a cursory reading of the paper I also did not find results of diagnostics or validations of the models that might have helped evaluate whether this was a good choice.
As you note, in colon cancer there is generally a large jump in mortality between TNM stage II and stage III, with much less of a jump between stage I and stage II. That would undercut the numeric analysis of stage. Figure 1C and 3C in the paper, on overall survival, seem to support your contention, but the figures on cancer-specific survival (1A and 3A) seem pretty well fit by an equal-step-per-class model.
Whether the numeric treatment of such an ordered classification is appropriate can best be addressed in light of how well it models the data. As models are all approximations to reality at best, it's possible that the errors introduced by such numeric treatment are balanced by other advantages, provided that the validity of the model is not adversely affected. For example, numeric treatment of such a variable has the advantage of only using one degree of freedom, rather than 2 or more if treated as a multi-level factor.
(2) Cox regression and log-rank tests
As @Scortchi put it simply in a comment: "Log-rank tests are score tests for the hazard ratios from a Cox regression model with a single categorical predictor." I find these lecture notes to be a short explanation directly on point, with useful extensions to different tests. The tests are asymptotically equivalent, although their results of course may differ in any particular data set.
Historical note: I discovered item (1) by accident when I was analyzing a data set to produce the unadjusted univariate hazard ratios that are still typically demanded by clinical journals despite their dubious usefulness. Having a large number of both categorical and numeric predictor variables, I wrote a script to automate the process, taking advantage of the ability of R to deal well with both types of variables. I was initially shocked to obtain only a single HR for each of T, N, and TNM stage. Of course these variables had been entered as numbers and, in my poorly designed script, R interpreted them as numeric variables even though I had been thinking of them as ordered categories. What was perhaps more surprising, until I thought about it, was that these numerically-treated categorical variables bore reasonable relations to outcome. Those HRs for the numeric treatment of the data made it into the final publication (carefully annotated as having been analyzed numerically, of course, along with reports on diagnostics and model validation).
Best Answer
Most likely you have a situation where, for a particular level of a categorical variable, >0 people have an event, but for the other levels, no one does. That makes it so that the HR is infinite--the extremely large HR you find is the computer's attempt to tell you the MLE is infinity. In logistic regression, this is called "complete separation"; I assume this term still applies for survival analysis but, regardless of the name, you get the idea.
The root problem is when there is complete separation is that your model is overfitting the data. Evidently, for some level of your categorical predictor, there's a very low chance of an event, so you'd need a much larger sample size to be able to estimate the HR. It's analogous to a contingency table with some very small expected counts.