In R
, there are several packages that calculate Cronbach's alpha. If the data contain no missing values, all packages I visited converge to the same value. With missing data, listwise deletion is a possible way to go (the only option in SPSS or packages MBESS
and psy
btw). However, listwise deletion might lead to dropping a lot of data and therefore something like pairwise deletion might seem more appealing in some situations (let's say data are MCAR).
Some of these packages have an option like na.rm = T
, but then the results differ, and they differ more the more missings a data set contains. The reason for this is the calculation of the variance-covariance matrix $S$. For example, the package coefficientalpha
employs a robust $S$ matrix naturally leading to different results. However, other packages use pairwise $S$ matrices and still the results differ.
Let's make a specific example. The formula for alpha is
$$\alpha = \frac{p}{p-1}(1-\frac{\sum_{i=1}^p \sigma_{y_i}^2}{\sigma_x^2})$$
with the number of items $p$, the variance of the $i$th item $\sigma_{y_i}^2$ and the variance of the total score $\sigma_x^2$. For the total variance in the denominator, the package ltm
uses something like var(rowSums(dat, na.rm = T))
, whereas the package psych
uses something like sum(cov(dat, use = "pairwise"))
. These two estimates are the same for complete data, but differ with missing data. (Package epicalc
seems to use a third way of calculating the variance(s), but I have not yet mastered their code.) What is the reason for different calculations of $\sigma_x^2$ and when should I use which one?
> require("MASS", "psych", "ltm")
> n <- 200
> it <- 10
> V <- matrix(.4, ncol = it, nrow = it)
> diag(V) <- 1
> dat <- mvrnorm(n, rep(0, it), V) # complete data
> m1 <- matrix(rbinom(n * it, 1, .1), nrow = n, ncol = it) # 10% missings
> dat[m1 == 1] <- NA # missing data
> alpha(as.data.frame(dat), na.rm = T)$total[[1]] # psych package
[1] 0.8595489
> cronbach.alpha(dat, na.rm = T)$alpha # ltm package
[1] 0.8105867
Best Answer
If data are MCAR, one would like to find an unbiased estimated of alpha. This could possibly be done via multiple imputation or listwise deletion. However, the latter might lead to severe loss of data. A third way is something like pairwise deletion which is implemented via an
na.rm
option inalpha()
of theltm
package and incronbach.alpha()
of thepsych
package.At least IMHO, the former estimate of unstandardized alpha with missing data is biased (see below). This is due to the calculation of the total variance $\sigma^2_x$ via
var(rowSums(dat, na.rm = TRUE))
. If the data are centered around 0, positive and negative values cancel each other out in the calculation ofrowSums
. With missing data, this leads to a bias ofrowSums
towards 0 and therefore to an underestimation of $\sigma^2_x$ (and alpha, in turn). Contrarily, if the data are mostly positive (or negative), missings will lead to a bias ofrowSums
towards zero this time resulting in an overestimation of $\sigma^2_x$ (and alpha, in turn).