NA Handling: You can control how glm handles missing data. glm() has an argument na.action
which indicates which of the following generic functions should be used by glm to handle NA in the data:
na.omit
and na.exclude
: observations are removed if they contain any missing values; if na.exclude is used some functions will pad residuals and predictions to the correct length by inserting NAs for omitted cases.
na.pass
: keep all data, including NAs
na.fail
: returns the object only if it contains no missing values
If you don't set na.action, glm() will check R's global options to see if a default is set there. You can access your options with getOption("na.action")
or options("na.action")
and you can set it with, for example, options(na.action = "na.omit")
However, from the R output you provide in example 1, it seems that you are setting na.action = na.omit
. So, yes, in that instance at least, you are removing all cases/rows with NAs before fitting. Moreover, I am pretty sure na.action = na.pass
would cause glm() to fail when the data have NAs (try it).
Errors:
glm() is using an iterative procedure (iterated weighted least squares; IWLS) to make maximum-likelihood estimates. You sometimes get errors because it will only go through a predefined number of iterations and, if it doesn't have a good fit then, it gives up. This number is controlled by the argument maxit, which by default is maxit = 25
. You could try setting it higher, though of course, this will take longer. (If you set trace=TRUE
it will show you the outcome of each iteration.)
Other sources of information: The helpfile for glm is accessible with ?glm
or help(glm)
and explains much of this. Two other useful resources are:
Much depends on the reasons why data are missing. There are 3 commonly cited missingness mechanisms, missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).
MCAR means that missing values occur randomly in that variable without any dependence on any other variable, observed or not.
MAR means that missing values occur randomly in that variable but the probability of being missing depends on values of one or more other observed variables (which might include your outcome variable).
If missingness depends on unobserved variables then the data are missing not at random (MNAR).
Deleting observations with missing data (known as complete-case analysis or listwise deletion), is a bad idea at the very least because it discards information which results in larger standard errors, wider confidence intervals and loss of power. Under MCAR the estimates will be unbiased, but under MAR they may be biased:
Complete Case analysis confines attention to cases where all variables are present. Advantages of this approach are .... . Disadvantages stem from the potential loss of information in discarding incomplete cases. This loss of information has two aspects: loss of precision, and bias when the missing-data mechanism is not MCAR, and the complete cases are not a random sample of all the cases".
From: Statistical Analysis with Missing Data, Second Edition, Roderick J.A. Little & Donald B Rubin, John Wiley and Sons, 2002. p41:
http://dx.doi.org/10.1002/9781119013563
Creating a factor/indicator/dummy variable for missingness is also a biased method, for example see:
White IR, Carlin JB. Bias and efficiency of multiple imputation compared with
complete-case analysis for missing covariate values. Stat Med 2010;29:2920-31.
http://dx.doi.org/10.1002/sim.3944
Jones MP. Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 1996;91:222-30.
http://dx.doi.org/10.1080/01621459.1996.10476680
If the data are plausibly MAR or MCAR then multiple imputation will yield unbiased estimates when applied correctly and standard errors will be smaller than with complete-case analysis. If missingness depends on unobserved variables then the data are missing not at random (MNAR) and this is much more difficult to handle.
Multiple imputation works by filling in missing values with plausible values from a model. This is done multiple times and each time the imputed values will differ to allow for uncertainty. The analysis model is run on each imputed dataset and the results are pooled. In essence, the method works because, on the one hand, while it is possible to estimate the most likely values for the missing data, the most likely values are in fact unlikely to be the correct values: there is inherent uncertainty. The variability in the values which are imputed between each completed dataset provides the uncertainty which is needed to reflect the uncertainty created by the missing values.
MICE
is an excellent package for R which implements multiple imputation.
https://www.jstatsoft.org/index.php/jss/article/view/v045i03/v45i03.pdf
Update:
Examples on how to handle missing values in r using the imputation methods and MICE
packages:
https://uvastatlab.github.io/2019/05/01/getting-started-with-multiple-imputation-in-r/
Best Answer
If you go down the multiple imputation route, the MICE package in R is useful, and there is a good tutorial for it here.