Solved – How to perform a bivariate regression using pairwise deletion of missing values in R

lmmissing datar

Is there any way to perform bivariate regression using pairwise deletion of missing values in R? na.action options in lm() do not offer such a possibility – the default na.action is na.omit, which is equivalent to listwise deletion. I already tried estimating the covariance matrix using pairwise deletion and then use the function mat.regress (package psych) with the pairwise covariance matrix. However, mat.regress is a function to compute multiple regression (not bivariate). Thank you.

Best Answer

When you use pair wise deletion to estimate a covariance matrix it just means that for any pair of variables you use all available observations that are not missing on either covariate.

So if you had a data matrix

#| A  B  C
-----------
1| 1  1  NA
2| 2  NA 2
3| NA 3  3
4| 4  4  4

When calculating the covariance between columns A and B you would use rows (observations) 1 and 4, and when calculating the covariance between A and C you would only use rows 2 and 4.

So in the case of bivariate regression or simple linear regression, it is equivalent to list-wise deletion.

Related Solutions

Solved – How does R handle missing values in lm

Edit: I misunderstood your question. There are two aspects:

a) na.omit and na.exclude both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like residuals() or fitted() will pad their output with NAs for the omitted cases with na.exclude, thus having an output of the same length as the input variables.

> N    <- 20                               # generate some data
> y1   <- rnorm(N, 175, 7)                 # criterion 1
> y2   <- rnorm(N,  30, 8)                 # criterion 2
> x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3,  5)] <- NA                    # some NA values
> y2[c(7, 9, 11)] <- NA                    # some other NA values
> Y    <- cbind(y1, y2)                    # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
> dim(residuals(fitO))                     # use extractor function
[1] 14  2

> fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
> dim(residuals(fitE))                     # use extractor function -> = N
[1] 20  2

> dim(fitE$residuals)                      # access residuals directly
[1] 14  2

b) The real issue is not with this difference between na.omit and na.exclude, you don't seem to want casewise deletion that takes criterion variables into account, which both do.

> X <- model.matrix(fitE)                  # design matrix
> dim(X)                                   # casewise deletion -> only 14 complete cases
[1] 14  2

The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm() by doing something along the lines of the following:

> Xf <- model.matrix(~ x)                    # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+     idx   <- !is.na(y)                     # throw away NAs
+     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
+     # get X+ but note: there might be better ways
+     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }

> res <- apply(Y, 2, getFit)    # get fits for each column of Y
> res$y1$coefs
                   [,1]
(Intercept) 113.9398761
x             0.7601234

> res$y2$coefs
                 [,1]
(Intercept) 91.580505
x           -0.805897

> coefficients(lm(y1 ~ x))      # compare with separate results from lm()
(Intercept)           x 
113.9398761   0.7601234 

> coefficients(lm(y2 ~ x))
(Intercept)           x 
  91.580505   -0.805897

Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm().

Solved – Cronbach’s Alpha with missing data

If data are MCAR, one would like to find an unbiased estimated of alpha. This could possibly be done via multiple imputation or listwise deletion. However, the latter might lead to severe loss of data. A third way is something like pairwise deletion which is implemented via an na.rm option in alpha() of the ltm package and in cronbach.alpha() of the psych package.

At least IMHO, the former estimate of unstandardized alpha with missing data is biased (see below). This is due to the calculation of the total variance $\sigma^2_x$ via var(rowSums(dat, na.rm = TRUE)). If the data are centered around 0, positive and negative values cancel each other out in the calculation of rowSums. With missing data, this leads to a bias of rowSums towards 0 and therefore to an underestimation of $\sigma^2_x$ (and alpha, in turn). Contrarily, if the data are mostly positive (or negative), missings will lead to a bias of rowSums towards zero this time resulting in an overestimation of $\sigma^2_x$ (and alpha, in turn).

require("MASS"); require("ltm"); require("psych")
n <- 10000
it <- 20
V <- matrix(.4, ncol = it, nrow = it)
diag(V) <- 1
dat <- mvrnorm(n, rep(0, it), V)  # mean of 0!!!
p <- c(0, .1, .2, .3)
names(p) <- paste("% miss=", p, sep="")
cols <- c("alpha.ltm", "var.tot.ltm", "alpha.psych", "var.tot.psych")
names(cols) <- cols
res <- matrix(nrow = length(p), ncol = length(cols), dimnames = list(names(p), names(cols)))
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.35        0.93        168.35
## % miss=0.1      0.90      138.21        0.93        168.32
## % miss=0.2      0.86      110.34        0.93        167.88
## % miss=0.3      0.81       86.26        0.93        167.41
dat <- mvrnorm(n, rep(10, it), V)  # this time, mean of 10!!!
for(i in 1:length(p)){
  m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
  dat1 <- dat
  dat1[m1 == 1] <- NA
  res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
      res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
      res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
  res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
##            alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0        0.93      168.31        0.93        168.31
## % miss=0.1      0.99      316.27        0.93        168.60
## % miss=0.2      1.00      430.78        0.93        167.61
## % miss=0.3      1.01      511.30        0.93        167.43

Best Answer

Related Solutions

Solved – How does R handle missing values in lm

Solved – Cronbach’s Alpha with missing data

Related Question