Solved – Intraclass correlation coefficient interpretation

agreement-statisticssmall-samplespss

I'm having a look at the intraclass correlation coefficient in SPSS.

Data: 17 participants rated two lists of 9 & 7 items from 0 to 5 (0 being unimportant and 5 being very important).

All participants rated all the items and the participants are a sample of a large population.

The following output has been produced in SPSS.

7 items case

9 items case

I am struggling to find anything online which deals with interpreting this, nor does any book interpret this in the level of detail I need.

There is surprisingly little information/examples of the interpretation online on this, the literature is about choosing an intraclass correlation coefficient and not interpreting it.

One problem I foresee here is the F test. The data is only 17 responses, which do not follow the normality assumption.

Best Answer

I am struggling to find anything online which deals with interpreting this

The output you present is from SPSS Reliability Analysis procedure. Here you had some variables (items) which are raters or judges for you, and 17 subjects or objects which were rated. Your focus was to assess inter-rater aggreeement by means of intraclass correlation coefficient.

In the 1st example you tested p=7 raters, and in the 2nd you tested p=9.

More importantly, your two outputs differ in the respect how the raters are considered. In the 1st example, the raters are a fixed factor, which means they are the population of raters for you: you infer about only these specific raters. In the 2nd example, the raters are a random factor, which means they are a random sample of raters for you, while you want infer about the population of all possible raters which those 9 pretend to represent.

The 17 subjects that were rated constitute a random sample of population of subjects. And, since each rater rated all 17 subjects, both models are complete two-way (two-factor) models, one is fixed+random=mixed model, the other is random+random=random model.

Also, in both instances you requested to assess the consistency between raters, that is, how well their ratings correlate, - rather than to assess the absolute agreement between them - how much identical their scores are. With measuring consistency, Average measures ICC (see the tables) are identical to Cronbach's alpha. Average measures ICC tells you how reliably the/a group of p raters agree. Single measures ICC tells you how reliable is for you to use just one rater. Because, if you know the agreement is high you might choose to inquire from just one rater for that sort of task.

If you tested the same number of the same raters (and the same subjects) under both models you'd see that the estimates in the table are the same under both models. However, as I've said, the interpretation differs in that you can generalize the conclusion about the agreement onto the whole population of raters only with two-way random model. You can see also a footnote saying that the mixed model assumes there is no rater-subject interaction; to put clearer, it means that the raters lack individual partialities to subjects' characteristics not relevant to the rated task (e.g. to hair colour of an examenee).

SPSS Reliability Analysis procedure assumes additivity of scores (which logically implies interval or dichotomous but not ordinal level of data) and bivariate normality between items/raters. However, F test is quite robust.