Solved – Kendall’s coefficient of concordance (W) for ratings with a lot of ties

agreement-statisticsf-testlikertnonparametricsmall-sample

I have read that Kendall W should be avoided when it comes to deal with non-rankings especially for rating scales which tend to have a lot of ties.
Yet posts here seem to suggest it for ratings. As stated in this post
I have a small study of 21 respondents, who rated some items from 0-5, with 0 being unimportant and 5 being very important and I'm looking for measures of agreement for specific respondents. I am not looking for Absolute agreement.

Whilst ICC was suggested as a possible solution, there is an issue with the use of the F test in this case, given the small number of respondents.

What are your views on Kendall W in this case?

Best Answer

I don't have anything specific to say about Kendall's W but I don't get this concern about the ICC, the F test and the sample size.

Your sample is not so small that testing would necessarily be impossible but why would you want to do such a test? To see if agreement is different from 0? This is quite a low bar and should be evident from the data. If you have doubts about that, these ratings certainly don't form a good measure of anything the raters agree on so worrying about which specific measure of inter-rater of agreement you are using and the niceties of the relevant tests would not really be your main concern.

On the other hand, anything you compute on a sample this small will obviously be subject to a lot of sampling variability and uncertainty. It's a rather basic fact that has nothing to do with ICC or the F-test specifically and there is no miracle inter-rater agreement index that would allow you to go around that.

At the end of the day, I think the underlying issue is that you seem to be asking many rather abstract questions in search for the “true” inter-rater agreement and some sort of fail/pass test that would tell you if it is “good enough”. Such a thing simply does not exist in my opinion and published threshold are really quite arbitrary. Instead of trying to interpret every bit of advice recommending one index or another, I think it could be more fruitful to read broadly about inter-rater agreement measures (see the references provided in other questions on this topic) and think about what each of them reveal about your data rather than focus solely on whether agreement is “good” or not.

Related Question