Solved – Impute missing values using aregImpute

cox-modelmissing datarsurvival

I have a data frame with 61 columns. Some data is missing. I read in Steyerberg's book about aregImpute in Hmisc. I used it with standard parameters and all columns of my data frame as formula.

g <- aregImpute(formula = ~ ..., n.impute=5, data=d)

b <- data.frame(rsq=g$rsq, y=attributes(g$rsq))
b <- b[order(-b$rsq), ]
row.names(b) <- NULL
b

imputed <-impute.transcan(g, data=d, imputation=1, list.out=TRUE, pr=FALSE, check=FALSE)
i <- d
i[names(imputed)] <- imputed
head(i)

Before imputation I check the R-squares for Predicting Non-Missing Values for Each Variable of g. Afterwards I imputed the missing data as shown in Steyerbergs example.

  1. Up to which value for R-square would you say an imputation is good for further usage?

  2. When should I force linear transformations of continuous variables using I(x) in the formula and when not?

  3. How works multiple imputation? Would I create n imputed data frames and perform the following analysis n times?

UPDATE

> (fmla <- as.formula(paste(" ~ ", paste(var.names, collapse=" +"))))
~x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + 
    x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + 
    x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + x32 + 
    x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + x42 + 
    x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + x52 + 
    x53 + x54 + x55 + x56 + x57 + x58
> g <- aregImpute(formula = fmla, n.impute=5, data=d)
fewer than 3 unique knots.  Frequency table of variable:
x
  1   2   3   4 
641  51  10  11 
Error in rcspline.eval(z, knots = parms, nk = nk, inclx = TRUE) : 
In addition: Warning message:
In rcspline.eval(z, knots = parms, nk = nk, inclx = TRUE) :
  3 knots requested with 4 unique values of x.  knots set to 2 interior values.

aregImpute gives me an error that I cannot solve. The printed frequency table I find for none of my variables with exact these numbers. What could be the problem?

Best Answer

You can't really use $R^2$ in the manner you suggest. If $R^2$ is 0 you can still use multiple imputation; it's just not better than a random guess in that case.

Force linearity of all the variables if you have a very small sample size relative to the number of variables.

aregImpute stores all the multiple imputations, then the fit.mult.impute function calls for them, one imputation at a time, to fill in the original dataset.

Related Question