I'd rather answer on the grounds of the methodology itself, rather than how to "fix" the situation. In another context, I assisted in working on a ratings and classification system, and found that inter-rater agreement was disappointingly low. Two paths were considered
- Change how rating agreements were defined and identify those who seemed to "understand" the task, or
- Refine the definitions used, along with the guidance and examples provider to raters, so that they could more easily understand how to rate things.
In the first sceneario, the whole methodology and results could be rendered a waste simply because the inter-rater reliability was low. It indicated that either the original definitions were bad or that the raters were given poor instructions. If I proceeded along that path, I was sure to have problems.
In the second case, the agreement between raters was very good. Since they rated quite a lot of items, they could also give feedback with when they thought the original definitions and guidance were inadequate. In the end, the methodology was very reproducible.
Based on that, I would not yet modify your set of raters, but return to the original definitions and guidance. Any tinkering after the rating is a problem, though it can be useful as a quality check. There are sometimes raters that are going to do what they want, no matter the guidance given. With good statistical methods, it is easy to identify them and weight their contributions appropriately.
Now, if I'm mistaken and you don't plan to do further collection, i.e. your data is already collected and done, what you may do is PCA or something like it, and see if you can get a sense of how the different doctors (or patients) cluster.
Were the patients exposed to all of the doctors at the same time (e.g. through a video recording) or were they exposed sequentially, and had a chance to modify their presentation with each interaction? If the latter, then there could be issues with the patients, and not the doctors.
From my initial readings, there are rather extensive debates (e.g., here, and here) around the measurement of reliable change. However, at the risk of not being fully cognizant of the nuances of such debates, your second approach (MDC) seems reasonable, and your first approach (non-overlapping confidence intervals) does not.
You are presumably trying to rule out the null hypothesis that the change between two measurements is zero where there has been some error in measuring the variable of interest at each time point. In this sense the problem is analogous to an independent groups t-test where the denominator is the $\sqrt{SE^2_a + SE^2_b}$, which simplifies to $\sqrt{2}SE$ where $SE^2_a$ and $SE^2_b$ are equal. This provides a standard error of the measurement of change, which is presumably what you want.
Best Answer
You should use the point estimate of the reliability, not the lower bound or whatsoever. I guess by lb/up you mean the 95% CI for the ICC (I don't have SPSS, so I cannot check myself)? It's unfortunate that we also talk of Cronbach's alpha as a "lower bound for reliability" since this might have confused you.
It should be noted that this formula is not restricted to the use of an estimate of ICC; in fact, you can plug in any "valid" measure of reliability (most of the times, it is Cronbach's alpha that is being used). Apart from the NCME tutorial that I linked to in my comment, you might be interested in this recent article:
Although it might seem to barely address your question at first sight, it has some additional material showing how to compute SEM (here with Cronbach's $\alpha$, but it is straightforward to adapt it with ICC); and, anyway, it's always interesting to look around to see how people use SEM.