Solved – Assessing and testing inter-rater agreement with kappa statistic on a set of binary and Likert items

agreement-statisticsbinary datacohens-kappalikert

I am trying to calculate inter-rater reliability scores for 10 survey questions (on which there were 2 raters)–seven questions are binary (yes/no), and 3 are Likert-scale questions.

  1. Should inter-rater reliability be tested on EACH of the 10 questions, or is there an overall inter-relater reliability test that tests reliability of all questions at one time? If so, what is it?

  2. For the binary questions, the agreement level between the two raters is 70-90% on nearly every question, however the Kappa score is often very poor (0.2- 0.4). Can this be right? (And if so, is there a more appropriate test?)

  3. Finally, can you use a Kappa-based test on Likert scale questions? If not, what is the correct test for inter-rater reliability?

Best Answer

  1. With regards to whether you should compute agreement for each item, this depends somewhat on how you plan to analyse the data.
    • If you plan to compute scale scores (e.g., sum up the binary responses or sum up the likert responses) to form a scale, then you could perform a reliability analysis on the scale scores. In this situation, you may be starting to have enough scale points to use other procedures for inter-rater reliability assessment that assume numeric data, such as looking at ICC. Your overall evaluation of reliability would then focus on the scale score. Reliability analysis of individual items might then just be used as a means of assessing which items to include in the composite scale (e.g., you could drop items with particularly low agreement).
    • If you plan to report individual items, then you would want to report kappa for each item. You may still find it useful to summarise these individual kappas, in order to quickly communicate the general reliability of the items (e.g., report range, mean, and sd of kappa across items).
  2. If you don't like the Kappa values that you are getting, this is not a reason not to use Kappa (apologies for the triple negative).
    • It may be that your rules of thumb for interpreting Kappa are inappropriate.
    • Alternatively, it may be that items are just not that reliable (high percentages of agreement can be obtained when variables are skewed even when the two raters disagree on which cases are in the minority category). In general, individual items are going to be less reliable than composite scales; also some binary evaluations are quite clear (e.g., gender), but in other cases where a judge is being asked whether an object passes over some threshold, ratings might be more reliable if they were asked to rate on a continuum.
  3. You can use an ordinal kappa on likert items. @chl has an excellent discussion of the issues and alternatives here.
Related Question