NA Values in GLM – How Are ‘NA’ Values Treated in GLM in R

generalized linear modelmissing datar

I have a data table T1, that contains nearly a thousand variables (V1) and around 200 million data points. The data is sparse and most of the entries are NA. Each datapoints have a unique id and date pair to distinguish from another.

I have another table T2, which contains a separate set of variables (V2). This table also has id and date pair that uniquely identify entries in T2.

We suspect that the data in T1 can be used to predict values of variables in T2.

To prove this, I thought to apply 'glm' model in R and check if we can really find some variable in T2 that is dependent on variables in T1.

For each variable in T2, I started pulling out all data in T1 having same id and date pair which resulted in much smaller ~50K data points for some of test variables.

The problems that I am facing now with application of glm is as follows.

  1. In some cases, it shows me an error 'fit not found' and warning 'glm.fit: algorithm did not converge '. I am not sure why is it shown?

  2. How the NAs are treated in glm? Does it remove all records involving 'NA' first and then perform fitting?

  3. Is it a good strategy to remove all NAs first and then call 'glm'. I fear that this may reduce the datapoints significantly as most of them are NAs.

  4. Which method is used to calculate the coefficients. I couldnot find any website or paper or book that discuss how the output is calculated.

I tested glm with and without 'NAs' and found difft answers which points that NAs are considered while fitting the data:

Example 1:

> tmpData
  x1 x2 x3        Y
1  1  1  1        3
2  1  0  4        5
3  1  2  3        6
4  0  3  1        4

Call:  glm(formula = as.formula(paste(dep, " ~ ", paste(xn, collapse = "+"))), 
    na.action = na.exclude)

Coefficients:
                      (Intercept)  as.numeric(unlist(tmpData["x1"]))  as.numeric(unlist(tmpData["x2"]))  
                        5.551e-16                          1.000e+00                          1.000e+00  
as.numeric(unlist(tmpData["x3"]))  
                        1.000e+00  

Degrees of Freedom: 3 Total (i.e. Null);  0 Residual
Null Deviance:      5 
Residual Deviance: 9.861e-31    AIC: -260.6 

Example 2:

'x1'    'x2'    'x3'    'Y'
100000  1   NA  2
1   1   1   3
1   NA  -1124   2
1   0   4   5
1   2   3   6
0   3   1   4



Coefficients:
                      (Intercept)  as.numeric(unlist(tmpData["x1"]))  as.numeric(unlist(tmpData["x2"]))  as.numeric(unlist(tmpData["x3"]))  
                       -2.3749044                         -0.0000625                          0.6249899                          1.8749937  

Degrees of Freedom: 5 Total (i.e. Null);  2 Residual
Null Deviance:      13.33 
Residual Deviance: 1.875    AIC: 20.05 

Best Answer

NA Handling: You can control how glm handles missing data. glm() has an argument na.action which indicates which of the following generic functions should be used by glm to handle NA in the data:

  • na.omit and na.exclude: observations are removed if they contain any missing values; if na.exclude is used some functions will pad residuals and predictions to the correct length by inserting NAs for omitted cases.
  • na.pass: keep all data, including NAs
  • na.fail: returns the object only if it contains no missing values

If you don't set na.action, glm() will check R's global options to see if a default is set there. You can access your options with getOption("na.action") or options("na.action") and you can set it with, for example, options(na.action = "na.omit") However, from the R output you provide in example 1, it seems that you are setting na.action = na.omit. So, yes, in that instance at least, you are removing all cases/rows with NAs before fitting. Moreover, I am pretty sure na.action = na.pass would cause glm() to fail when the data have NAs (try it).

Errors: glm() is using an iterative procedure (iterated weighted least squares; IWLS) to make maximum-likelihood estimates. You sometimes get errors because it will only go through a predefined number of iterations and, if it doesn't have a good fit then, it gives up. This number is controlled by the argument maxit, which by default is maxit = 25 . You could try setting it higher, though of course, this will take longer. (If you set trace=TRUE it will show you the outcome of each iteration.)

Other sources of information: The helpfile for glm is accessible with ?glm or help(glm) and explains much of this. Two other useful resources are: