My question is about how to calculate inter-(intra-) class correlation coefficient (ICC) or intra-(inter-) concordance coefficient (CCC), ideally in python. My dataset consists of several dozen subjects. For each subject a feature was calculated using three different algorithms and each algorithm was repeated three times. So, I have 3×3=9 measurements for each subject. There is a lot of information on this forum about ICCs. But it is not clear to me which ICC I have to use. Ideally, I would like to make these calculations in python. Are there any python libraries for this?
Solved – Inter-(-intra) class correlation coefficient or intra-(inter-) concordance coefficient in python
concordanceintraclass-correlationpython
Related Solutions
What you describes about Tukey's nonadditivity test sounds good to me. In effect, it allows to test for an item by rater interaction. Some words of caution, though:
- Tukey's nonadditivity test effectively allows to test for a linear-by-linear product of two factor main effects.
- The possibility of deriving a total score is irrelevant here, as this particular Tukey's test can be applied in any randomized block design, as described on Stata FAQ, for example.
- It applies in situation where you have a single observation per cell, that is each rater assess only one item (no replicates).
You might recall that the interaction term is confounded with the error term when there're no replicates in an ANOVA design; in inter-rater studies, it means we have only one rating for each rater x item cell. Tukey's test in this case provide a 1-DF test for assessing any deviation from additivity, which is a common assumption to interpret a main effect in two-factor models. Here is a tutorial describing how it works.
I must admit I never used it when computing ICC, and I spent some times trying to reproduce Dave Garson's results with R. This led me to the following two papers that showed that Tukey's nonadditivity test might not be the "best" test to use as it will fail to recover a true interaction effect (e.g., where some raters exhibit an opposite rating behavior compared to the rest of the raters) when there's no main effect of the target of the ratings (e.g., marks given to items):
- Lahey, M.A., Downey, R.G., and Saal, F.E. (1983). Intraclass Correlations: There's More There Than Meets the Eye. Psychological Bulletin, 93(3), 586-595.
- Johnson, D.E. and Graybill, F.A. (1972). An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association, 67, 862-868.
- Hegemann, V. and Johnson, D.E. (1976). The power of two tests for nonadditivity. Journal of the American Statistical Association, 71(356), 945-948.
(I'm very sorry but I couldn't find ungated PDF version of those papers. The first one is really a must-read one.)
About your particular design, you considered raters as fixed effects (hence the use of Shrout and Fleiss's type 3 ICC, i.e. mixed model approach). In this case, Lahey et al. (1) stated that you face a situation of nonorthogonal interaction components (i.e., the interaction is not independent of other effect) and a biased estimate of the rating effect -- but, this for the case where you have a single observation per cell (ICC(3,1)). With multiple ratings per items, estimating ICC(3,k) requires the "assumption of nonsignificance of the interaction. In this case, the ANOVA effects are neither theoretically nor mathematically independent, and without adequate justification, the assumption of no interaction is very tenuous."
In other words, such an interaction test aims at offering you diagnostic information. My opinion is that you can go on with you ICC, but be sure to check that (a) there's a significant effect for the target of ratings (otherwise, it would mean the reliability of measurements is low), (b) no rater systematically deviates from others' ratings (this can be done graphically, or based on the residuals of your ANOVA model).
More technical details are given below.
The alternative test that is proposed is called the characteristic root test of the interaction (2,3). Consider a multiplicative interaction model of the form (here, as an effect model, that is we use parameters that summarize deviations from the grand mean):
$$\mu_{ij}=\mu + \tau_i + \beta_j + \lambda\alpha_i\gamma_j + \varepsilon_{ij}$$
with $\tau$ ($i=1,\dots,t$) the effect due to targets/items, $\beta$ ($j=1,\dots,b$) the effect of raters, $\alpha\gamma$ the interaction targets x raters, and the usual assumptions for the distribution of errors and parameters constraints. We can compute the largest characteristic root of $Z'Z$ or $ZZ'$, where $Z=z_{ij}=y_{ij}-y_{i\cdot}-y_{\cdot j}+y_{\cdot\cdot}$ is the $t \times b$ matrix of residuals from an additive model.
The test then relies on the idea of using $\lambda_1/\text{RSS}$ as a test statistic ($H_0:\, \lambda=0$) where $\lambda_1$ is the largest nonzero characteristic root of $ZZ'$ (or $Z'Z$), and RSS equals the residual sum of squares from an additive model (2).
- Is the model appropriate for the design of the study?
I think not. The issue is that you have only 2 raters. You are asking the software to estimate the variance of a normally distributed variable using only 2 observations, so any estimate of a variance for this variable, and any statistic that uses it, should be highly suspect.
- Is the formula for the ICC appropriate, or should $\sigma_{\mathrm{ID:Day}}^{2}$ be omitted from the numerator and hence be treated as error-variance?
Yes, I think your formula is appropriate.
- How would the model and the formula change if I would consider the 2 raters as fixed in the sense that they are the only two raters I would ever consider (i.e. they weren't selected from an infinite population of possible raters)?
In light of my answer to 1. above, I think you should take this approach anyway. Whether they can be considered samples from a large population is only one of the considerations in choosing whether to model a factor as fixed or random.
The formula then becomes:
$$ \mathrm{ICC}_{\mathrm{inter-rater}} = \frac{\sigma_{\mathrm{ID}}^{2} + \sigma_{\mathrm{ID:Day}}^{2}}{\sigma_{\mathrm{ID}}^{2} + \sigma_{\mathrm{ID:Day}}^{2} + \sigma_{\mathrm{Residual}}^{2}} $$
Best Answer
I see no one has answered in 3 years and question may not be relevant.
But I would like to provide answer for future users.
Firstly, in Python you can use pingouin package:
It returns a dataframe with lots of different ICC computed (ICC1, ICC2, ICC2). Just read the docs and choose the right one.
Secondly, you can also do the same thing in R
Don't forget to read the package docs.