Is there any way to perform bivariate regression using pairwise deletion of missing values in R? na.action options in lm() do not offer such a possibility – the default na.action is na.omit, which is equivalent to listwise deletion. I already tried estimating the covariance matrix using pairwise deletion and then use the function mat.regress (package psych) with the pairwise covariance matrix. However, mat.regress is a function to compute multiple regression (not bivariate). Thank you.
Solved – How to perform a bivariate regression using pairwise deletion of missing values in R
lmmissing datar
Related Solutions
Edit: I misunderstood your question. There are two aspects:
a) na.omit
and na.exclude
both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like residuals()
or fitted()
will pad their output with NA
s for the omitted cases with na.exclude
, thus having an output of the same length as the input variables.
> N <- 20 # generate some data
> y1 <- rnorm(N, 175, 7) # criterion 1
> y2 <- rnorm(N, 30, 8) # criterion 2
> x <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3, 5)] <- NA # some NA values
> y2[c(7, 9, 11)] <- NA # some other NA values
> Y <- cbind(y1, y2) # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit) # fit with na.omit
> dim(residuals(fitO)) # use extractor function
[1] 14 2
> fitE <- lm(Y ~ x, na.action=na.exclude) # fit with na.exclude
> dim(residuals(fitE)) # use extractor function -> = N
[1] 20 2
> dim(fitE$residuals) # access residuals directly
[1] 14 2
b) The real issue is not with this difference between na.omit
and na.exclude
, you don't seem to want casewise deletion that takes criterion variables into account, which both do.
> X <- model.matrix(fitE) # design matrix
> dim(X) # casewise deletion -> only 14 complete cases
[1] 14 2
The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm()
by doing something along the lines of the following:
> Xf <- model.matrix(~ x) # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+ idx <- !is.na(y) # throw away NAs
+ Xsvd <- svd(Xf[idx , ]) # SVD decomposition of X
+ # get X+ but note: there might be better ways
+ Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+ list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }
> res <- apply(Y, 2, getFit) # get fits for each column of Y
> res$y1$coefs
[,1]
(Intercept) 113.9398761
x 0.7601234
> res$y2$coefs
[,1]
(Intercept) 91.580505
x -0.805897
> coefficients(lm(y1 ~ x)) # compare with separate results from lm()
(Intercept) x
113.9398761 0.7601234
> coefficients(lm(y2 ~ x))
(Intercept) x
91.580505 -0.805897
Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm()
.
If data are MCAR, one would like to find an unbiased estimated of alpha. This could possibly be done via multiple imputation or listwise deletion. However, the latter might lead to severe loss of data. A third way is something like pairwise deletion which is implemented via an na.rm
option in alpha()
of the ltm
package and in cronbach.alpha()
of the psych
package.
At least IMHO, the former estimate of unstandardized alpha with missing data is biased (see below). This is due to the calculation of the total variance $\sigma^2_x$ via var(rowSums(dat, na.rm = TRUE))
. If the data are centered around 0, positive and negative values cancel each other out in the calculation of rowSums
. With missing data, this leads to a bias of rowSums
towards 0 and therefore to an underestimation of $\sigma^2_x$ (and alpha, in turn). Contrarily, if the data are mostly positive (or negative), missings will lead to a bias of rowSums
towards zero this time resulting in an overestimation of $\sigma^2_x$ (and alpha, in turn).
require("MASS"); require("ltm"); require("psych")
n <- 10000
it <- 20
V <- matrix(.4, ncol = it, nrow = it)
diag(V) <- 1
dat <- mvrnorm(n, rep(0, it), V) # mean of 0!!!
p <- c(0, .1, .2, .3)
names(p) <- paste("% miss=", p, sep="")
cols <- c("alpha.ltm", "var.tot.ltm", "alpha.psych", "var.tot.psych")
names(cols) <- cols
res <- matrix(nrow = length(p), ncol = length(cols), dimnames = list(names(p), names(cols)))
for(i in 1:length(p)){
m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
dat1 <- dat
dat1[m1 == 1] <- NA
res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
## alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0 0.93 168.35 0.93 168.35
## % miss=0.1 0.90 138.21 0.93 168.32
## % miss=0.2 0.86 110.34 0.93 167.88
## % miss=0.3 0.81 86.26 0.93 167.41
dat <- mvrnorm(n, rep(10, it), V) # this time, mean of 10!!!
for(i in 1:length(p)){
m1 <- matrix(rbinom(n * it, 1, p[i]), nrow = n, ncol = it)
dat1 <- dat
dat1[m1 == 1] <- NA
res[i, 1] <- cronbach.alpha(dat1, standardized = FALSE, na.rm = TRUE)$alpha
res[i, 2] <- var(rowSums(dat1, na.rm = TRUE))
res[i, 3] <- alpha(as.data.frame(dat1), na.rm = TRUE)$total[[1]]
res[i, 4] <- sum(cov(dat1, use = "pairwise"))
}
round(res, 2)
## alpha.ltm var.tot.ltm alpha.psych var.tot.psych
## % miss=0 0.93 168.31 0.93 168.31
## % miss=0.1 0.99 316.27 0.93 168.60
## % miss=0.2 1.00 430.78 0.93 167.61
## % miss=0.3 1.01 511.30 0.93 167.43
Best Answer
When you use pair wise deletion to estimate a covariance matrix it just means that for any pair of variables you use all available observations that are not missing on either covariate.
So if you had a data matrix
When calculating the covariance between columns
A
andB
you would use rows (observations) 1 and 4, and when calculating the covariance betweenA
andC
you would only use rows 2 and 4.So in the case of bivariate regression or simple linear regression, it is equivalent to list-wise deletion.