I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.
clear
set seed 10101
set obs 200
* generate orthogonal individual traits
generate trait1 = rnormal()
generate trait2 = rnormal()
* raters' interecepts for individual questions
local q1list 0.3 0.7 -0.2 -0.4
local q2list 0.5 0.5 0 -0.5
* prefixes
local letters a b c d
forvalues k = 1/4 {
local thisletter : word `k' of `letters'
local rater1 : word `k' of `q1list'
local rater2 : word `k' of `q2list'
generate byte `thisletter'1 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater1' )
generate byte `thisletter'2 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater2' )
}
This should produce something like
. list a1-d2 in 1/5, noobs
+---------------------------------------+
| a1 a2 b1 b2 c1 c2 d1 d2 |
|---------------------------------------|
| 1 1 0 0 1 0 1 1 |
| 0 0 0 0 0 1 1 1 |
| 0 0 0 0 0 0 0 0 |
| 1 0 0 1 1 1 1 1 |
| 0 0 0 1 1 1 1 1 |
+---------------------------------------+
which I hope resembles your data, at least in terms of the existing variables.
A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:
generate int pattern1 = 0
generate int pattern2 = 0
forvalues k = 1/4 {
local thisletter : word `k' of `letters'
replace pattern1 = pattern1 + `thisletter'1 * 2^(4-`k')
replace pattern2 = pattern2 + `thisletter'2 * 2^(4-`k')
tab pattern*
}
This should produce something like
. list a1- d2 pat* in 1/5, noobs
+-------------------------------------------------------------+
| a1 a2 b1 b2 c1 c2 d1 d2 pattern1 pattern2 |
|-------------------------------------------------------------|
| 1 1 0 0 1 0 1 1 11 9 |
| 0 0 0 0 0 1 1 1 1 3 |
| 0 0 0 0 0 0 0 0 0 0 |
| 1 0 0 1 1 1 1 1 11 7 |
| 0 0 0 1 1 1 1 1 3 7 |
+-------------------------------------------------------------+
Now, these patterns are perfectly comparable using kap
:
. kap pattern1 pattern2
Agreement Exp.Agrmt Kappa Std. Err. Z Prob>Z
-----------------------------------------------------------------
54.00% 17.91% 0.4396 0.0308 14.25 0.0000
You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.
To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first reshape
the data to make it possible:
generate long id = _n
* reshape the raters
reshape long a b c d , i(id) j(rater 1 2)
* reshape the items
forvalues k = 1/4 {
local thisletter : word `k' of `letters'
rename `thisletter' q`k'
}
reshape long q , i(id rater) j(item 1 2 3 4)
Now, we can run xtmelogit
(or gllamm
if you like it better) on this data:
. xtmelogit q || _all : R.rater || _all: R.item || _all: R.id, nolog
Note: factor variables specified; option laplace assumed
Mixed-effects logistic regression Number of obs = 1600
Group variable: _all Number of groups = 1
Obs per group:
min = 1600
avg = 1600.0
max = 1600
Integration points = 1 Wald chi2(0) = .
Log likelihood = -697.55526 Prob > chi2 = .
------------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | -.7795316 .9384147 -0.83 0.406 -2.618791 1.059727
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
_all: Identity |
sd(R.rater) | .1407056 .1627763 .0145745 1.358408
-----------------------------+------------------------------------------------
_all: Identity |
sd(R.item) | 1.797133 .6461083 .8882897 3.635847
-----------------------------+------------------------------------------------
_all: Identity |
sd(R.id) | 3.18933 .2673165 2.706171 3.758751
------------------------------------------------------------------------------
LR test vs. logistic regression: chi2(3) = 793.71 Prob > chi2 = 0.0000
Note: LR test is conservative and provided only for reference.
Note: log-likelihood calculations are based on the Laplacian approximation.
This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:
. local Vrater ( exp(2*_b[lns1_1_1:_cons]) )
. local Vitem ( exp(2*_b[lns1_2_1:_cons]) )
. local Vid ( exp(2*_b[lns1_3_1:_cons]) )
. nlcom `Vrater' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )
-----------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+----------------------------------------------------------------
_nl_1 | .0011847 .0027384 0.43 0.665 -.0041824 .0065519
-----------------------------------------------------------------------
. nlcom `Vid' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )
------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+----------------------------------------------------------------
_nl_1 | .6086839 .0903816 6.73 0.000 .4315393 .7858285
------------------------------------------------------------------------
. nlcom `Vitem' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )
-----------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+----------------------------------------------------------------
_nl_1 | .193265 .1121376 1.72 0.085 -.0265206 .4130506
-----------------------------------------------------------------------
(Hint: I figured out the names of the parameters by matrix list e(b)
.)
These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like
xtmelogit q ibn.item##ibn.rater, nocons || id:
With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.
If you have the patience (or a powerful computer), you can specify intp(7)
or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).
Best Answer
I don't have anything specific to say about Kendall's W but I don't get this concern about the ICC, the F test and the sample size.
Your sample is not so small that testing would necessarily be impossible but why would you want to do such a test? To see if agreement is different from 0? This is quite a low bar and should be evident from the data. If you have doubts about that, these ratings certainly don't form a good measure of anything the raters agree on so worrying about which specific measure of inter-rater of agreement you are using and the niceties of the relevant tests would not really be your main concern.
On the other hand, anything you compute on a sample this small will obviously be subject to a lot of sampling variability and uncertainty. It's a rather basic fact that has nothing to do with ICC or the F-test specifically and there is no miracle inter-rater agreement index that would allow you to go around that.
At the end of the day, I think the underlying issue is that you seem to be asking many rather abstract questions in search for the “true” inter-rater agreement and some sort of fail/pass test that would tell you if it is “good enough”. Such a thing simply does not exist in my opinion and published threshold are really quite arbitrary. Instead of trying to interpret every bit of advice recommending one index or another, I think it could be more fruitful to read broadly about inter-rater agreement measures (see the references provided in other questions on this topic) and think about what each of them reveal about your data rather than focus solely on whether agreement is “good” or not.