Solved – How does R handle missing values in lm

linear modelmissing datar

I'd like to regress a vector B against each of the columns in a matrix A. This is trivial if there are no missing data, but if matrix A contains missing values, then my regression against A is constrained to include only rows where all values are present (the default na.omit behavior). This produces incorrect results for columns with no missing data. I can regress the column matrix B against individual columns of the matrix A, but I have thousands of regressions to do, and this is prohibitively slow and inelegant. The na.exclude function seems to be designed for this case, but I can't make it work. What am I doing wrong here? Using R 2.13 on OSX, if it matters.

A = matrix(1:20, nrow=10, ncol=2)
B = matrix(1:10, nrow=10, ncol=1)
dim(lm(A~B)$residuals)
# [1] 10 2 (the expected 10 residual values)

# Missing value in first column; now we have 9 residuals
A[1,1] = NA  
dim(lm(A~B)$residuals)
#[1]  9 2 (the expected 9 residuals, given na.omit() is the default)

# Call lm with na.exclude; still have 9 residuals
dim(lm(A~B, na.action=na.exclude)$residuals)
#[1]  9 2 (was hoping to get a 10x2 matrix with a missing value here)

A.ex = na.exclude(A)
dim(lm(A.ex~B)$residuals)
# Throws an error because dim(A.ex)==9,2
#Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) : 
#  variable lengths differ (found for 'B')

Best Answer

Edit: I misunderstood your question. There are two aspects:

a) na.omit and na.exclude both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like residuals() or fitted() will pad their output with NAs for the omitted cases with na.exclude, thus having an output of the same length as the input variables.

> N    <- 20                               # generate some data
> y1   <- rnorm(N, 175, 7)                 # criterion 1
> y2   <- rnorm(N,  30, 8)                 # criterion 2
> x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3,  5)] <- NA                    # some NA values
> y2[c(7, 9, 11)] <- NA                    # some other NA values
> Y    <- cbind(y1, y2)                    # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
> dim(residuals(fitO))                     # use extractor function
[1] 14  2

> fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
> dim(residuals(fitE))                     # use extractor function -> = N
[1] 20  2

> dim(fitE$residuals)                      # access residuals directly
[1] 14  2

b) The real issue is not with this difference between na.omit and na.exclude, you don't seem to want casewise deletion that takes criterion variables into account, which both do.

> X <- model.matrix(fitE)                  # design matrix
> dim(X)                                   # casewise deletion -> only 14 complete cases
[1] 14  2

The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm() by doing something along the lines of the following:

> Xf <- model.matrix(~ x)                    # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+     idx   <- !is.na(y)                     # throw away NAs
+     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
+     # get X+ but note: there might be better ways
+     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }

> res <- apply(Y, 2, getFit)    # get fits for each column of Y
> res$y1$coefs
                   [,1]
(Intercept) 113.9398761
x             0.7601234

> res$y2$coefs
                 [,1]
(Intercept) 91.580505
x           -0.805897

> coefficients(lm(y1 ~ x))      # compare with separate results from lm()
(Intercept)           x 
113.9398761   0.7601234 

> coefficients(lm(y2 ~ x))
(Intercept)           x 
  91.580505   -0.805897

Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm().