I have read about what censoring is and how it needs to be accounted for in survival analysis but I would like to hear a less mathematical definition of it and a more intuitive definition (pictures would be great!). Can anyone provide me with an explanation of 1) censoring and 2) how it effects things like Kaplan-Meier curves and Cox regression?
Survival Analysis – Layman’s Explanation of Censoring in Survival Analysis
censoringcox-modelsurvival
Related Solutions
This is a very good example of non-proportional hazards OR the effect of 'depletion' in survival analysis. I will try to explain.
At first take a good look at your Kaplan-Meier (KM) curve: you can see in the first part (until around 3000 days) the proportion of males still alive in the population at risk at time t is larger than the proportion of females (i.e. the blue line is 'higher' than the red one). This means that indeed male gender is 'protective' for the event (death) studied. Accordingly the hazard ratio should be between 0 and 1 (and the coefficient should be negative).
However, after day 3000, the red line is higher! This would indeed suggest the opposite. Based on this KM graph alone, this would further suggest a non-proportional hazard. In this case 'non-proportional' means that the effect of your independent variable (gender) is not constant over time. In other words, the hazard ratio is viable to change as time progresses. As explained above, this seems the case. The regular proportional hazard Cox model does not accommodate such effects. Actually, one of the main assumptions is that the hazards are proportional! Now you can actually model non-proportional hazards as well, but that is beyond the scope of this answer.
There is one additional comment to make: this difference could be due to the true hazards being non-proportional or the fact that there is a lot of variance in the tail estimates of the KM curves. Note that at this point in time the total group of 348 patients will have declined to a very small population still at risk. As you can see, both gender groups have patients experiencing the event and patients being censored (the vertical lines). As the population at risk declines, the survival estimates will become less certain. If you would have plotted 95% confidence intervals around the KM lines, you would see the width of the confidence interval increasing. This is important for the estimation of hazards as well. Put simply, as the population at risk and amount of events in the final period of your study is low, this period will contribute less to the estimates in your initial cox model.
Finally, this would explain why the hazard (assumed constant over time) is more in line with the first part of your KM, instead of the final endpoint.
EDIT: see @Scrotchi's spot-on comment to the original question: As stated, the effect of low numbers in the final period of the study is that the estimates of the hazards at those points in time are uncertain. Consequently you are also less certain whether the apparent violation of the proportional hazards assumption isn't due to chance. As @ scrotchi's states, the PH assumption may not be that bad.
Best Answer
Censoring is often described in comparison with truncation. Nice description of the two processes is provided by Gelman et al (2005, p. 235):
Censoring or truncation can occur for values above some level (right-censoring), below some level (left-censoring), or both.
Below you can find example of standard normal distribution that is censored at point $2.0$ (middle) or truncated at $2.0$ (right). If sample is truncated we have no data beyond the truncation point, with censored sample values above the truncation point are "rounded" to the boundary value, so they are over-represented in your sample.
Intuitive example of censoring is that you ask your respondents about their age, but record it only up to some value and all the ages above this value, say 60 years, are recorded as "60+". This leads to having precise information for non-censored values and no information about censored values.
Not so typical, real-life example of censoring was observed in Polish matura exam scores that caught pretty much attention on the internet. The exam is taken at the end of high school and students must pass it in order to be able to apply for higher education. Can you guess from the plot below what is the minimal amount of points that students need to get to pass the exam? Not surprisingly, the "gap" in otherwise normal distribution can be easily "filled in" if you take an appropriate fraction of the over-represented scores just above the censoring boundry.
In case of survival analysis
(Kleinbaum and Klein, 2005, p. 5). For example, you treat patients with some drug and observe them until end your study, but you have no knowledge what happens to them after study finishes (were there any relapses or side effects?), the only thing that you know is that they "survived" at least until end of the study.
Below you can find example of data generated from Weibull distribution modeled using Kaplan–Meier estimator. Blue curve marks model estimated on the full dataset, in the middle plot you can see censored sample and model estimated on censored data (red curve), on right you see truncated sample and model estimated on such sample (red curve). As you can see, missing data (truncation) has a significant impact on estimates, but censoring can be easily managed using standard survival analysis models.
This does not mean that you cannot analyze truncated samples, but in such cases you have to use models for missing data that try to "guess" the unknown information.
Kleinbaum, D.G. and Klein, M. (2005). Survival Analysis: A Self-Learning Text. Springer.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2005). Bayesian Data Analysis. Chapman & Hall/CRC.