Solved – Fleiss’ Kappa v. Cohen’s Kappa v. Cronbach’s Alpha

agreement-statisticscohens-kappacronbachs-alpha

I am working on a project that compares two methods of assessing whether certain topics come up in a therapy session. I am comparing a checklist full of dichotomous (y/n) items completed by an observer who listens to the therapy session to the same checklist that the therapist completes after the therapy session. There are 5 observers who are randomly assigned therapy sessions to observe. My goal is to see the agreement between the two ways of completing the checklist.

I've thought of a few ways to do this:

Cohen's Kappa

  • Meets all assumptions except: the same two raters are not used for all observations.

Fleiss' Kappa

  • Meets all assumptions except: the targets that are being rated are not technically picked out randomly from a population. The therapists in the study choose to be in the study and were not randomly selected. But, the raters were randomly assigned to observe different sessions.

Cronbach's Alpha (Specifically Kuder-Richardson Formula 20 (KR-20))

  • I'm not sure if this technically answers the question I'm hoping to answer.

Can anyone advise on what to do? Or, if you know of any papers that do a similar thing, please send article titles. Thanks so much!

P.S. (and somewhat unrelated) Why would the Fleiss' kappa be so different than Cohen's Kappa?

Best Answer

You may want to consider using a quasi-independence model here.

I didn't see the length of the questionnaire so I'll assume it's 30 y/n questions for this description, but that is inconsequential. If you model this as a comparison of two methods (questionnaire at end of session (M1) vs questionnaire during recorded playback (M2)) you can represent the results as counts in a 2x2x30 (M1xM2xQ) contingency table.

Assuming independence of methods, you would fit the counts with a log-linear model. However, this would likely fit poorly as we expect the raters to agree much of the time (they are listening to the same patient after all and are likely qualified to be filling out the checklist). A Quasi-Independence model adds extra terms that effectively zero out the residuals of the main diagonal of your contingency table, i.e. it fits perfectly the cells where the raters agree. This means that only the discordant results (e.g. M1 = y, M2 = n) influence the model fit. From there you can determine whether the therapist or the observer is more prone to answering "y" or "n" and if so on which question(s).

I believe the default error distribution for log-linear models in most software is the Poisson, but you can also try the Quasipoisson or Negative Binomial to add a dispersion parameter.

Note: you won't be able to assess the results of a particular therapist or observer with this structure, but if you have a broad enough sample of therapists and observers and they are randomly paired this should not be a major issue (relative to the limitations of your study).

Agresti's "An Introduction to Categorical Data Analysis" has a section on this topic (8.5.2 in the Second addition)

PSU Has a section on QI models as well.

Related Question