Solved – should I use na.omit or na.exclude in a linear model (in R)

linear modelmissing datarregressionresiduals

I try to understand the difference between using different na.actions (na.omit and na.exclude) to handle missing data in a linear model using R. I used the lm function in R (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html).

The summary of my linear model prints exactly the same for na.omit and na.exclude. But if I look at the residuals, the na.exclude gives me NA' s for the cases where there was an NA and it gives me a list of residuals without NA for the na.omit.

residuals.lm(lm_1_exclude)

 1            2            3            4            5            6            7            8            
-0.052657302 -0.093045084 -0.329509087  NA -0.152040821 -0.321757328 -0.322085368  0.072296134  

residuals.lm(lm_1_omit)

 1            2            3              5            6            7            8            
-0.052657302 -0.093045084 -0.329509087   -0.152040821 -0.321757328 -0.322085368  0.072296134  

I now would like to understand, which option should be preferred in a linear model and how this affects my statistics.. unfortunately, I only could find tutorials explaining that these different options exist, but no advice on what to choose. https://stats.idre.ucla.edu/r/faq/how-does-r-handle-missing-values/

Best Answer

The only benefit of na.exclude over na.omit is that the former will retain the original number of rows in the data. This may be useful where you need to retain the original size of the dataset - for example it is useful when you want to compare predicted values to original values. With na.omit you will end up with fewer rows so you won't as easily be able to compare.

Related Question