These are distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) (cases 1 to 3 in Table 1):
- One-way random effects model: raters are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
- Two-way random effects model: both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
- Two-way mixed model: contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.
I would say raters have to be entered as columns, although I'm not a specialist of SPSS.
Dave Garson's dedicated website is worth looking at for those working with SPSS. There is also a complete on-line tutorial on reliability analysis (Robert A. Yaffee) [archived version].
For theoretical consideration about the mixed-effect approach, please consider reading my answer to this related question: Reliability in Elicitation Exercise.
I'd rather answer on the grounds of the methodology itself, rather than how to "fix" the situation. In another context, I assisted in working on a ratings and classification system, and found that inter-rater agreement was disappointingly low. Two paths were considered
- Change how rating agreements were defined and identify those who seemed to "understand" the task, or
- Refine the definitions used, along with the guidance and examples provider to raters, so that they could more easily understand how to rate things.
In the first sceneario, the whole methodology and results could be rendered a waste simply because the inter-rater reliability was low. It indicated that either the original definitions were bad or that the raters were given poor instructions. If I proceeded along that path, I was sure to have problems.
In the second case, the agreement between raters was very good. Since they rated quite a lot of items, they could also give feedback with when they thought the original definitions and guidance were inadequate. In the end, the methodology was very reproducible.
Based on that, I would not yet modify your set of raters, but return to the original definitions and guidance. Any tinkering after the rating is a problem, though it can be useful as a quality check. There are sometimes raters that are going to do what they want, no matter the guidance given. With good statistical methods, it is easy to identify them and weight their contributions appropriately.
Now, if I'm mistaken and you don't plan to do further collection, i.e. your data is already collected and done, what you may do is PCA or something like it, and see if you can get a sense of how the different doctors (or patients) cluster.
Were the patients exposed to all of the doctors at the same time (e.g. through a video recording) or were they exposed sequentially, and had a chance to modify their presentation with each interaction? If the latter, then there could be issues with the patients, and not the doctors.
Best Answer
First, make sure your data is set up properly. Second, make sure that each video was rated by two or more raters. Third, make sure you are using the proper ICC formulation. Lastly, make sure you are using the ICC function properly.
You should set up your data as a matrix with each rater as a separate column and each row as a separate video. So you should end up with a $69\times130$ matrix. Make sure that your missing cells (i.e., rater-video combinations) are marked as "missing" in whatever program you are using.
In most functions, each video needs to be rated by two or more raters. Other videos should be thrown away (or a specialized function that accounts for this needs to be used).
I would not recommend ICC(1) for your described purpose. This formulation assumes that raters are not a meaningful source of variance, which they almost certainly are. Instead, you should use ICC(A,1) or ICC(C,1); these are sometimes called two-way mixed models for single-measures and either absolute agreement or consistency. Use ICC(A,1) if you want the raters to use the exact same scores and be fully interchangeable; use ICC(C,1) if you're okay with each rater having their own mean.
This will depend on the exact function you are using, but make sure you send the data to the function in the exact way that it expects.
If you use MATLAB, I have made the ICC_A_1 and ICC_C_1 functions available.
If you use SPSS, you can use the RELIABILITY function.
Give it a shot; if you run into problems, feel free to send me the data and I can do it for you (or at least seeing the data will give me more insight into the problem).