With a multinomial logit model you impose the constraint that all the predicted probabilities add up to 1. When you use separate binary logit model you can no longer impose that constraint, they are estimated in seperate models after all. So that would be the main difference between these two models.
As you can see in the example below (In Stata, as that is the program I know best), the models tend to be similar but not the same. I would be especially careful about extrapolating predicted probabilities.
// some data preparation
. sysuse nlsw88, clear
(NLSW, 1988 extract)
.
. gen byte occat = cond(occupation < 3 , 1, ///
> cond(inlist(occupation, 5, 6, 8, 13), 2, 3)) ///
> if !missing(occupation)
(9 missing values generated)
. label variable occat "occupation in categories"
. label define occat 1 "high" ///
> 2 "middle" ///
> 3 "low"
. label value occat occat
.
. gen byte middle = (occat == 2) if occat !=1 & !missing(occat)
(590 missing values generated)
. gen byte high = (occat == 1) if occat !=2 & !missing(occat)
(781 missing values generated)
// a multinomial logit model
. mlogit occat i.race i.collgrad , base(3) nolog
Multinomial logistic regression Number of obs = 2237
LR chi2(6) = 218.82
Prob > chi2 = 0.0000
Log likelihood = -2315.9312 Pseudo R2 = 0.0451
-------------------------------------------------------------------------------
occat | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
high |
race |
black | -.4005801 .1421777 -2.82 0.005 -.6792433 -.121917
other | .4588831 .4962591 0.92 0.355 -.5137668 1.431533
|
collgrad |
college grad | 1.495019 .1341625 11.14 0.000 1.232065 1.757972
_cons | -.7010308 .0705042 -9.94 0.000 -.8392165 -.5628451
--------------+----------------------------------------------------------------
middle |
race |
black | .6728568 .1106792 6.08 0.000 .4559296 .889784
other | .2678372 .509735 0.53 0.599 -.7312251 1.266899
|
collgrad |
college grad | .976244 .1334458 7.32 0.000 .714695 1.237793
_cons | -.517313 .0662238 -7.81 0.000 -.6471092 -.3875168
--------------+----------------------------------------------------------------
low | (base outcome)
-------------------------------------------------------------------------------
// separate logits:
. logit high i.race i.collgrad , nolog
Logistic regression Number of obs = 1465
LR chi2(3) = 154.21
Prob > chi2 = 0.0000
Log likelihood = -906.79453 Pseudo R2 = 0.0784
-------------------------------------------------------------------------------
high | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
race |
black | -.5309439 .1463507 -3.63 0.000 -.817786 -.2441017
other | .2670161 .5116686 0.52 0.602 -.735836 1.269868
|
collgrad |
college grad | 1.525834 .1347081 11.33 0.000 1.261811 1.789857
_cons | -.6808361 .0694323 -9.81 0.000 -.816921 -.5447512
-------------------------------------------------------------------------------
. logit middle i.race i.collgrad , nolog
Logistic regression Number of obs = 1656
LR chi2(3) = 90.13
Prob > chi2 = 0.0000
Log likelihood = -1098.9988 Pseudo R2 = 0.0394
-------------------------------------------------------------------------------
middle | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
race |
black | .6942945 .1114418 6.23 0.000 .4758725 .9127164
other | .3492788 .5125802 0.68 0.496 -.6553598 1.353918
|
collgrad |
college grad | .9979952 .1341664 7.44 0.000 .7350339 1.260957
_cons | -.5287625 .0669093 -7.90 0.000 -.6599023 -.3976226
-------------------------------------------------------------------------------
I used mice (multiple imputation by chained equation). It's fairly fast, and quite simple. I used it on 3000 obs. for c.a. 10 variables. Done in 10min on an old computer. Further, I believe it is one of the best multiple-imputation packages out there.
It can use regression to impute, among other methods.
You need to create a dataframe with the variable you want to impute, and include every variable that might predict values of that variable (so every var. in your model + possibly other var. as well). The mice package will impute every missing value in that dataframe.
Simplest way of imputing. Gives you a dataframe Datimp
that has five imputed data + the original data.
library(mice)
#m=5 number of multiple imputations
#maxit=10 number of iterations. 10-20 is sufficient.
imp <- mice(Dat1, m=5, maxit=10, printFlag=TRUE)
Datimp <- complete(imp, "long", include=TRUE)
write.table(Datimp, "C:/.../impute1.txt",
sep="\t", dec=",", row.names=FALSE)
A better way to do this is:
library(mice)
Dat1 <- subset(Dat, select=c(id, faculty, gender, age, job, salary)) #create subset
#of variables you would like to either impute or use as predictors for imputation.
ini <- mice(Dat1, maxit=0, pri=F)
pred <- ini$pred
pred[,c("id", "faculty")] <- 0 #variables you do not want to use as predictors (but
#want to have in the dataset, can't add them later.
meth <- ini$meth
meth[c("id", "faculty", "gender", "age", "job")] <- "" #choose a prediction method
#for imputing your variables. Here I don't want these variables to be imputed, so I
#choose "" (empty, no mehod).
imp <- mice(Dat1, m=5, maxit=10, printFlag=TRUE, pred=pred, meth=meth, seed=2345)
Datimp <- complete(imp, "long", include=TRUE)
write.table(Datimp, "C:/.../impute1.txt",
sep="\t", dec=",", row.names=FALSE)
See if your imputations were any good:
library(lattice)
com <- complete(imp, "long", inc=T)
col <- rep(c("blue","red")[1+as.numeric(is.na(imp$salary))],6)
stripplot(salary~.imp, data=com, jit=TRUE, fac=0.8, col=col, pch=20,
xlab="Imputation number",cex=0.25)
densityplot(salary~.imp, data=com, jit=TRUE, fac=0.8, col=col, pch=20,
xlab="Imputation number",cex=0.25)
long <- complete(imp,"long")
levels(long$.imp) <- paste("Imputation",1:22)
long <- cbind(long, salary.na=is.na(imp$data$salary))
densityplot(~salary|.imp, data=long, group=salary, plot.points=FALSE, ref=TRUE,
xlab="Salary",scales=list(y=list(draw=F)),
par.settings=simpleTheme(col.line=rep(c("blue","red"))), auto.key =
list(columns=2,text=c("Observed","Imputed")))
Finally, and importantly. You can't just save your new dataset and use your imputed values as normal observed values. You use pooled regression or pooled lmer ...So the uncertainty of the imputed values is taken into account.
fit1 <- with(imp, lm(salary ~ gender, na.action=na.omit))
summary(est <- pool(fit1))
pool.r.squared(fit1,adjusted=FALSE)
Best Answer
You could also try the following package: DMwR.
It failed on the case of 3 NN, giving 'Error in knnImputation(x, k = 3) : Not sufficient complete cases for computing neighbors.'
However, trying 2 gives.
You can test for sufficient observations using complete.cases(x), where that value must be at least k.
One way to overcome this problem is to relax your requirements (i.e. less incomplete rows), by 1) increasing the NA threshold, or alternatively, 2) increasing your number of observations.
Here is the first:
here is an example of the 2nd...
At least k=3 complete rows are satisfied, thus it is able to impute for k=3.