I have a data frame with 61 columns. Some data is missing. I read in Steyerberg's book about aregImpute
in Hmisc
. I used it with standard parameters and all columns of my data frame as formula.
g <- aregImpute(formula = ~ ..., n.impute=5, data=d)
b <- data.frame(rsq=g$rsq, y=attributes(g$rsq))
b <- b[order(-b$rsq), ]
row.names(b) <- NULL
b
imputed <-impute.transcan(g, data=d, imputation=1, list.out=TRUE, pr=FALSE, check=FALSE)
i <- d
i[names(imputed)] <- imputed
head(i)
Before imputation I check the R-squares for Predicting Non-Missing Values for Each Variable of g
. Afterwards I imputed the missing data as shown in Steyerbergs example.
-
Up to which value for R-square would you say an imputation is good for further usage?
-
When should I force linear transformations of continuous variables using I(x) in the formula and when not?
-
How works multiple imputation? Would I create n imputed data frames and perform the following analysis n times?
UPDATE
> (fmla <- as.formula(paste(" ~ ", paste(var.names, collapse=" +"))))
~x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 +
x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 +
x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + x32 +
x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + x42 +
x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + x52 +
x53 + x54 + x55 + x56 + x57 + x58
> g <- aregImpute(formula = fmla, n.impute=5, data=d)
fewer than 3 unique knots. Frequency table of variable:
x
1 2 3 4
641 51 10 11
Error in rcspline.eval(z, knots = parms, nk = nk, inclx = TRUE) :
In addition: Warning message:
In rcspline.eval(z, knots = parms, nk = nk, inclx = TRUE) :
3 knots requested with 4 unique values of x. knots set to 2 interior values.
aregImpute
gives me an error that I cannot solve. The printed frequency table I find for none of my variables with exact these numbers. What could be the problem?
Best Answer
You can't really use $R^2$ in the manner you suggest. If $R^2$ is 0 you can still use multiple imputation; it's just not better than a random guess in that case.
Force linearity of all the variables if you have a very small sample size relative to the number of variables.
aregImpute
stores all the multiple imputations, then thefit.mult.impute
function calls for them, one imputation at a time, to fill in the original dataset.