Solved – Cronbach’s Alpha with missing data

cronbachs-alphamissing datarreliability

In R, there are several packages that calculate Cronbach's alpha. If the data contain no missing values, all packages I visited converge to the same value. With missing data, listwise deletion is a possible way to go (the only option in SPSS or packages MBESS and psy btw). However, listwise deletion might lead to dropping a lot of data and therefore something like pairwise deletion might seem more appealing in some situations (let's say data are MCAR).
Some of these packages have an option like na.rm = T, but then the results differ, and they differ more the more missings a data set contains. The reason for this is the calculation of the variance-covariance matrix $S$. For example, the package coefficientalpha employs a robust $S$ matrix naturally leading to different results. However, other packages use pairwise $S$ matrices and still the results differ.
Let's make a specific example. The formula for alpha is
$$\alpha = \frac{p}{p-1}(1-\frac{\sum_{i=1}^p \sigma_{y_i}^2}{\sigma_x^2})$$
with the number of items $p$, the variance of the $i$th item $\sigma_{y_i}^2$ and the variance of the total score $\sigma_x^2$. For the total variance in the denominator, the package ltm uses something like var(rowSums(dat, na.rm = T)), whereas the package psych uses something like sum(cov(dat, use = "pairwise")). These two estimates are the same for complete data, but differ with missing data. (Package epicalc seems to use a third way of calculating the variance(s), but I have not yet mastered their code.) What is the reason for different calculations of $\sigma_x^2$ and when should I use which one?

> require("MASS", "psych", "ltm")
> n <- 200
> it <- 10
> V <- matrix(.4, ncol = it, nrow = it)
> diag(V) <- 1
> dat <- mvrnorm(n, rep(0, it), V)   # complete data
> m1 <- matrix(rbinom(n * it, 1, .1), nrow = n, ncol = it)   # 10% missings
> dat[m1 == 1] <- NA   # missing data
> alpha(as.data.frame(dat), na.rm = T)$total[[1]]   # psych package  
[1] 0.8595489
> cronbach.alpha(dat, na.rm = T)$alpha   # ltm package
[1] 0.8105867

Best Answer

If data are MCAR, one would like to find an unbiased estimated of alpha. This could possibly be done via multiple imputation or listwise deletion. However, the latter might lead to severe loss of data. A third way is something like pairwise deletion which is implemented via an na.rm option in alpha() of the ltm package and in cronbach.alpha() of the psych package.

At least IMHO, the former estimate of unstandardized alpha with missing data is biased (see below). This is due to the calculation of the total variance $\sigma^2_x$ via var(rowSums(dat, na.rm = TRUE)). If the data are centered around 0, positive and negative values cancel each other out in the calculation of rowSums. With missing data, this leads to a bias of rowSums towards 0 and therefore to an underestimation of $\sigma^2_x$ (and alpha, in turn). Contrarily, if the data are mostly positive (or negative), missings will lead to a bias of rowSums towards zero this time resulting in an overestimation of $\sigma^2_x$ (and alpha, in turn).

require("MASS"); require("ltm"); require("psych")
n <- 10000
it <- 20
V <- matrix(.4, ncol = it, nrow = it)
diag(V) <- 1
dat <- mvrnorm(n, rep(0, it), V)  # mean of 0!!!
p <- c(0, .1, .2, .3)
names(p) <- paste("% miss=", p, sep="")
cols <- c("alpha.ltm", "var.tot.ltm", "alpha.psych", "var.tot.psych")
names(cols) <- cols
res <- matrix(nrow = length(p), ncol = length(cols), dimnames = list(names(p), names(cols)))
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.35        0.93        168.35
## % miss=0.1      0.90      138.21        0.93        168.32
## % miss=0.2      0.86      110.34        0.93        167.88
## % miss=0.3      0.81       86.26        0.93        167.41
dat <- mvrnorm(n, rep(10, it), V)  # this time, mean of 10!!!
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.31        0.93        168.31
## % miss=0.1      0.99      316.27        0.93        168.60
## % miss=0.2      1.00      430.78        0.93        167.61
## % miss=0.3      1.01      511.30        0.93        167.43

Related Solutions

Solved – Pairwise deletion in multiple regression

When you have so much missing data, the first concern is why the data are missing. They can be missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR). Searching on missing data here, or on any of those terms in Google, should give you lots of information.

Neither listwise nor pairwise deletion are good options with so much missing. If the data are MCAR or MAR, then it is certainly worthwhile looking at multiple imputation. Even if they are NMAR, multiple imputation may be best.

I don't know about SPSS capacity with regard to multiple imputation (I am not an SPSS user) but both R and SAS have excellent abilities in this regard.

Solved – How to perform a bivariate regression using pairwise deletion of missing values in R

When you use pair wise deletion to estimate a covariance matrix it just means that for any pair of variables you use all available observations that are not missing on either covariate.

So if you had a data matrix

#| A  B  C
-----------
1| 1  1  NA
2| 2  NA 2
3| NA 3  3
4| 4  4  4

When calculating the covariance between columns A and B you would use rows (observations) 1 and 4, and when calculating the covariance between A and C you would only use rows 2 and 4.

So in the case of bivariate regression or simple linear regression, it is equivalent to list-wise deletion.

Best Answer

Related Solutions

Solved – Pairwise deletion in multiple regression

Solved – How to perform a bivariate regression using pairwise deletion of missing values in R

Related Question