Check out Krippendorff's alpha. It has several advantages over some other measures such as Cohen's Kappa, Fleiss's Kappa, Cronbach's alpha: it is robust to missing data (which I gather is the main concern you have); it is capable of dealing with more than 2 raters; and it can handle different types of scales ( nominal, ordinal, etc.), and it also accounts for chance agreements better than some other measures like Cohen's Kappa.
Calculation of Krippendorff's alpha is supported by several statistical software packages, including R (by the irr package), SPSS, etc.
Below are some relevant papers, that discuss Krippendorff's alpha including its properties and its implementation, and compare it with other measures:
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77-89.
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411-433. doi: 10.1111/j.1468-2958.2004.tb00738.x
Chapter 3 in Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology (3rd ed.): Sage.
There are some additional technical papers in Krippendorff's website
I don't know which of the two ways to calculate the variance is to prefer but I can give you a third, practical and useful way to calculate confidence/credible intervals by using Bayesian estimation of Cohen's Kappa.
The R and JAGS code below generates MCMC samples from the posterior distribution of the credible values of Kappa given the data.
library(rjags)
library(coda)
library(psych)
# Creating some mock data
rater1 <- c(1, 2, 3, 1, 1, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3)
rater2 <- c(1, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1)
agreement <- rater1 == rater2
n_categories <- 3
n_ratings <- 15
# The JAGS model definition, should work in WinBugs with minimal modification
kohen_model_string <- "model {
kappa <- (p_agreement - chance_agreement) / (1 - chance_agreement)
chance_agreement <- sum(p1 * p2)
for(i in 1:n_ratings) {
rater1[i] ~ dcat(p1)
rater2[i] ~ dcat(p2)
agreement[i] ~ dbern(p_agreement)
}
# Uniform priors on all parameters
p1 ~ ddirch(alpha)
p2 ~ ddirch(alpha)
p_agreement ~ dbeta(1, 1)
for(cat_i in 1:n_categories) {
alpha[cat_i] <- 1
}
}"
# Running the model
kohen_model <- jags.model(file = textConnection(kohen_model_string),
data = list(rater1 = rater1, rater2 = rater2,
agreement = agreement, n_categories = n_categories,
n_ratings = n_ratings),
n.chains= 1, n.adapt= 1000)
update(kohen_model, 10000)
mcmc_samples <- coda.samples(kohen_model, variable.names="kappa", n.iter=20000)
The plot below shows a density plot of the MCMC samples from the posterior distribution of Kappa.
Using the MCMC samples we can now use the median value as an estimate of Kappa and use the 2.5% and 97.5% quantiles as a 95 % confidence/credible interval.
summary(mcmc_samples)$quantiles
## 2.5% 25% 50% 75% 97.5%
## 0.01688361 0.26103573 0.38753814 0.50757431 0.70288890
Compare this with the "classical" estimates calculated according to Fleiss, Cohen and Everitt:
cohen.kappa(cbind(rater1, rater2), alpha=0.05)
## lower estimate upper
## unweighted kappa 0.041 0.40 0.76
Personally I would prefer the Bayesian confidence interval over the classical confidence interval, especially since I believe the Bayesian confidence interval have better small sample properties. A common concern people tend to have with Bayesian analyses is that you have to specify prior beliefs regarding the distributions of the parameters. Fortunately, in this case, it is easy to construct "objective" priors by simply putting uniform distributions over all the parameters. This should make the outcome of the Bayesian model very similar to a "classical" calculation of the Kappa coefficient.
References
Sanjib Basu, Mousumi Banerjee and Ananda Sen (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, Vol. 56, No. 2 (Jun., 2000), pp. 577-582
Best Answer
The problem is that there is almost no variation among the raters and the tiny bit of variation that does exist is not in agreement. There are only two ratings that are not 16 and they are for different cases, so you get a negative kappa. That's a correct result. You may, however, want a different measure.