Solved – Nested logistic regression in R

logisticmissing datanested datapredictionregression

I am trying to find a way to do Nested Logistic Regression in R that fits my needs. I have a very large data set with almost 200 variables available. I have found my "best" model and it contains 12 of these variables. I am aware of the problems of using 12 variables in logistic regression and don't want to cover those here. I am using the model to predict the probability a student will drop out of high school. The dataset I'm using that includes the students I am predicting the probability for has some missing data. So for example say my model is:

Dropout ~ AttendanceRate + Enrollments + AgeDiff + Race + Absences

Now say it is not known for a particular student how many Absences he has had. I am using the predict function to find the dropout probabilities from my model. If one of the variables is missing or "NA" then there is no predicted probability as it is NA. To fix this problem I want to use Nested logistic regression. So for a student with a missing value in "Absences" the model would be

Dropout ~ AttendanceRate + Enrollments + AgeDiff + Race

I am currently using the glm function to do my logistic regression. I understand it cannot do nested logistic regression. So how do I do Nested Logistic regression from my full model so that I can predict a students dropout probablity no matter which variable or variables may be missing. I realize using Nested models will hurt the strength of my model for prediction purposes but I am willing to sacrifice that for my purposes.

Any Suggestions?

Best Answer

I suggest looking into multiple imputation for the missing data. Like other methods, this relies on them being missing at random, but may perform well even if they are missing not at random.

In R one package that does this is mice.

Related Question