Survival Analysis – Why Hazard Rate from Aalen Model Depends on Number of Factor Levels

cox-modelsurvival

As a follow up on How to understand Aalen additive regression model? I've conducted this method on my own data (which is, unfortunately, confidential). Hence, I cannot reveal what these data represent but I believe it doesn't matter, anyway.

img

img2

img3

You can see here the hazard rate(?) separated by country. I realized that the one country in the center in the lowest row (UK) has a quite high hazard rate. So I removed this country from the data just to check what happens. I figured out, that another country provided even larger rates then. So I removed this, too. And then two other countries had even higher values, than ever before.
What is going on? This doesn't seem robust to me?
I realized that

edit: Before I forget: For United Kingdom, there is only one subject. Other very "fragile" countries are Macedonia (with two subjects) or Czech Republic (with four subjects). In total, there are about 305 subjects.
The syntax of the fit is:

fit<- survfit(Surv(time, event) ~ country,
               data = df[df$country != "?" & df$country != "", ])

Best Answer

Aalen models are known to be unstable at the latest times, as there are few individuals left at risk for the regressions at each of the later event times. With a single UK individual whose event (if it happens at all) is at a very late time, this behavior isn't that surprising.

You also are probably overfitting your data. You evidently have 40 individual levels of a single categorical predictor (the reference level is what's shown in the Intercept). The usual rule of thumb in survival analysis is to have 10-20 events per coefficient that you are estimating. You should thus have 400-800 events for this model. If not all individuals experienced events, you presumably have fewer than 300 events. Such overfitting also can lead to the behavior that you describe, with high variability in predictions resulting from small changes in data. My sense is that this is even more of a problem in Aalen models with their repeated regressions at each event time, although I have almost no practical experience with them.

Related Question