I'd like to regress a vector B against each of the columns in a matrix A. This is trivial if there are no missing data, but if matrix A contains missing values, then my regression against A is constrained to include only rows where all values are present (the default na.omit behavior). This produces incorrect results for columns with no missing data. I can regress the column matrix B against individual columns of the matrix A, but I have thousands of regressions to do, and this is prohibitively slow and inelegant. The na.exclude function seems to be designed for this case, but I can't make it work. What am I doing wrong here? Using R 2.13 on OSX, if it matters.
A = matrix(1:20, nrow=10, ncol=2)
B = matrix(1:10, nrow=10, ncol=1)
dim(lm(A~B)$residuals)
# [1] 10 2 (the expected 10 residual values)
# Missing value in first column; now we have 9 residuals
A[1,1] = NA
dim(lm(A~B)$residuals)
#[1] 9 2 (the expected 9 residuals, given na.omit() is the default)
# Call lm with na.exclude; still have 9 residuals
dim(lm(A~B, na.action=na.exclude)$residuals)
#[1] 9 2 (was hoping to get a 10x2 matrix with a missing value here)
A.ex = na.exclude(A)
dim(lm(A.ex~B)$residuals)
# Throws an error because dim(A.ex)==9,2
#Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) :
# variable lengths differ (found for 'B')
Best Answer
Edit: I misunderstood your question. There are two aspects:
a)
na.omit
andna.exclude
both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions likeresiduals()
orfitted()
will pad their output withNA
s for the omitted cases withna.exclude
, thus having an output of the same length as the input variables.b) The real issue is not with this difference between
na.omit
andna.exclude
, you don't seem to want casewise deletion that takes criterion variables into account, which both do.The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of
lm()
by doing something along the lines of the following:Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using
lm()
.