Solved – Pairwise deletion in multiple regression

missing dataregression

Approximately 50% of cases are missing data on one of my predictor variables. With the default option selected (listwise treatment of missing data), the models produced are weak. This is probably because the listwise option reduces n substantially.

The alternative (pairwise exclusion), when selected, produces a strong model (the total variance explained is about 50%) with a number of significant predictors (the variable with 50% missing data is a significant predictor in this model).

However, this sounds a bit too optimistic. I've read that when pairwise exclusion is selected, SPSS will base degrees of freedom for significance testing on the number of cases with complete data (in this case, 32) rather than on the total number of cases. From what I understand, this means that the significant effects may be exaggerations.

Am I right to be concerned about the potential for exaggerated effects when pairwise exclusion is selected? Or are the parameter estimates (and the model as a whole) still trustworthy?

Best Answer

When you have so much missing data, the first concern is why the data are missing. They can be missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR). Searching on missing data here, or on any of those terms in Google, should give you lots of information.

Neither listwise nor pairwise deletion are good options with so much missing. If the data are MCAR or MAR, then it is certainly worthwhile looking at multiple imputation. Even if they are NMAR, multiple imputation may be best.

I don't know about SPSS capacity with regard to multiple imputation (I am not an SPSS user) but both R and SAS have excellent abilities in this regard.

Related Solutions

Solved – When, if ever, to use pairwise deletion in multiple regression

Pairwise is a dangerous method in this case, IMO. If you delete pairwise then you'll end up with different numbers of observations contributing to different parts of your model, which can make interpretation difficult.

That being said, casewise deletion tends to discard lots and lots of information, so I suppose it depends on both the proportion of missing responses, and your sample size.

Personally, I would probably use the multiple imputation procedure in SPSS and run the analyses for each dataset, then combine if nothing looks odd.

This would be my strategy of choice with a high proportion of missing values, whereas if the number is small, case-wise would probably be my first choice.

Solved – Cronbach’s Alpha with missing data

If data are MCAR, one would like to find an unbiased estimated of alpha. This could possibly be done via multiple imputation or listwise deletion. However, the latter might lead to severe loss of data. A third way is something like pairwise deletion which is implemented via an na.rm option in alpha() of the ltm package and in cronbach.alpha() of the psych package.

At least IMHO, the former estimate of unstandardized alpha with missing data is biased (see below). This is due to the calculation of the total variance $\sigma^2_x$ via var(rowSums(dat, na.rm = TRUE)). If the data are centered around 0, positive and negative values cancel each other out in the calculation of rowSums. With missing data, this leads to a bias of rowSums towards 0 and therefore to an underestimation of $\sigma^2_x$ (and alpha, in turn). Contrarily, if the data are mostly positive (or negative), missings will lead to a bias of rowSums towards zero this time resulting in an overestimation of $\sigma^2_x$ (and alpha, in turn).

require("MASS"); require("ltm"); require("psych")
n <- 10000
it <- 20
V <- matrix(.4, ncol = it, nrow = it)
diag(V) <- 1
dat <- mvrnorm(n, rep(0, it), V)  # mean of 0!!!
p <- c(0, .1, .2, .3)
names(p) <- paste("% miss=", p, sep="")
cols <- c("alpha.ltm", "var.tot.ltm", "alpha.psych", "var.tot.psych")
names(cols) <- cols
res <- matrix(nrow = length(p), ncol = length(cols), dimnames = list(names(p), names(cols)))
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.35        0.93        168.35
## % miss=0.1      0.90      138.21        0.93        168.32
## % miss=0.2      0.86      110.34        0.93        167.88
## % miss=0.3      0.81       86.26        0.93        167.41
dat <- mvrnorm(n, rep(10, it), V)  # this time, mean of 10!!!
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.31        0.93        168.31
## % miss=0.1      0.99      316.27        0.93        168.60
## % miss=0.2      1.00      430.78        0.93        167.61
## % miss=0.3      1.01      511.30        0.93        167.43

Best Answer

Related Solutions

Solved – When, if ever, to use pairwise deletion in multiple regression

Solved – Cronbach’s Alpha with missing data

Related Question