Solved – Cronbach’s Alpha with missing data

cronbachs-alphamissing datarreliability

In R, there are several packages that calculate Cronbach's alpha. If the data contain no missing values, all packages I visited converge to the same value. With missing data, listwise deletion is a possible way to go (the only option in SPSS or packages MBESS and psy btw). However, listwise deletion might lead to dropping a lot of data and therefore something like pairwise deletion might seem more appealing in some situations (let's say data are MCAR).
Some of these packages have an option like na.rm = T, but then the results differ, and they differ more the more missings a data set contains. The reason for this is the calculation of the variance-covariance matrix $S$. For example, the package coefficientalpha employs a robust $S$ matrix naturally leading to different results. However, other packages use pairwise $S$ matrices and still the results differ.
Let's make a specific example. The formula for alpha is
$$\alpha = \frac{p}{p-1}(1-\frac{\sum_{i=1}^p \sigma_{y_i}^2}{\sigma_x^2})$$
with the number of items $p$, the variance of the $i$th item $\sigma_{y_i}^2$ and the variance of the total score $\sigma_x^2$. For the total variance in the denominator, the package ltm uses something like var(rowSums(dat, na.rm = T)), whereas the package psych uses something like sum(cov(dat, use = "pairwise")). These two estimates are the same for complete data, but differ with missing data. (Package epicalc seems to use a third way of calculating the variance(s), but I have not yet mastered their code.) What is the reason for different calculations of $\sigma_x^2$ and when should I use which one?

> require("MASS", "psych", "ltm")
> n <- 200
> it <- 10
> V <- matrix(.4, ncol = it, nrow = it)
> diag(V) <- 1
> dat <- mvrnorm(n, rep(0, it), V)   # complete data
> m1 <- matrix(rbinom(n * it, 1, .1), nrow = n, ncol = it)   # 10% missings
> dat[m1 == 1] <- NA   # missing data
> alpha(as.data.frame(dat), na.rm = T)$total[[1]]   # psych package  
[1] 0.8595489
> cronbach.alpha(dat, na.rm = T)$alpha   # ltm package
[1] 0.8105867

Best Answer

If data are MCAR, one would like to find an unbiased estimated of alpha. This could possibly be done via multiple imputation or listwise deletion. However, the latter might lead to severe loss of data. A third way is something like pairwise deletion which is implemented via an na.rm option in alpha() of the ltm package and in cronbach.alpha() of the psych package.

At least IMHO, the former estimate of unstandardized alpha with missing data is biased (see below). This is due to the calculation of the total variance $\sigma^2_x$ via var(rowSums(dat, na.rm = TRUE)). If the data are centered around 0, positive and negative values cancel each other out in the calculation of rowSums. With missing data, this leads to a bias of rowSums towards 0 and therefore to an underestimation of $\sigma^2_x$ (and alpha, in turn). Contrarily, if the data are mostly positive (or negative), missings will lead to a bias of rowSums towards zero this time resulting in an overestimation of $\sigma^2_x$ (and alpha, in turn).

require("MASS"); require("ltm"); require("psych")
n <- 10000
it <- 20
V <- matrix(.4, ncol = it, nrow = it)
diag(V) <- 1
dat <- mvrnorm(n, rep(0, it), V)  # mean of 0!!!
p <- c(0, .1, .2, .3)
names(p) <- paste("% miss=", p, sep="")
cols <- c("alpha.ltm", "var.tot.ltm", "alpha.psych", "var.tot.psych")
names(cols) <- cols
res <- matrix(nrow = length(p), ncol = length(cols), dimnames = list(names(p), names(cols)))
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.35        0.93        168.35
## % miss=0.1      0.90      138.21        0.93        168.32
## % miss=0.2      0.86      110.34        0.93        167.88
## % miss=0.3      0.81       86.26        0.93        167.41
dat <- mvrnorm(n, rep(10, it), V)  # this time, mean of 10!!!
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.31        0.93        168.31
## % miss=0.1      0.99      316.27        0.93        168.60
## % miss=0.2      1.00      430.78        0.93        167.61
## % miss=0.3      1.01      511.30        0.93        167.43