Solved – Missing values in GLM

data mininggeneralized linear modelmissing datar

I am applying glm on a data in which most of the values are NAs or blank. For example, in the example data produced below (4 predictors and one response variable), the default glm command will remove 10 rows that contain 'NA' leaving just one row for analysis. This creates serious problem as some of my data that initially had 100000 rows (with ~50 features) cut down to 200 with (~15 features) or even less reducing the power significantly.

My question is:
What options do I have in this scenario. I do not want to fill the NAs with average values or anything based on variance/measure of centrality as it might be possible that TSH test is ordered only for patients with history of thyroid disease. In that case any extrapolation will be disastrous, as the mean won't be the actual representatives of the normal values.

gender  TSH PH  HDLC_hole   response
m   2   NA  36  TRUE
f   1.8 4   32  TRUE
m   NA  NA  29  TRUE
f       NA  NA  33  TRUE
m   2.2 5   NA  TRUE
f   2.5 4   NA  TRUE
NA  1.8 4   34  FALSE
m   NA  4   35  FALSE
f   3   NA  36  FALSE
m   1.2 4   NA  FALSE
m   1   NA  28  TRUE

Edit:

What I understood from a quick review of some of the papers on 'Multiple imputation' techniques is that they fill-in values based on mean/variance or similar statistics.

Should I change my method and explore others? For such cases what predictions algorithms may be appropriate.

Best Answer

If you go down the multiple imputation route, the MICE package in R is useful, and there is a good tutorial for it here.

Related Question