NA Values in GLM – How Are ‘NA’ Values Treated in GLM in R

generalized linear modelmissing datar

I have a data table T1, that contains nearly a thousand variables (V1) and around 200 million data points. The data is sparse and most of the entries are NA. Each datapoints have a unique id and date pair to distinguish from another.

I have another table T2, which contains a separate set of variables (V2). This table also has id and date pair that uniquely identify entries in T2.

We suspect that the data in T1 can be used to predict values of variables in T2.

To prove this, I thought to apply 'glm' model in R and check if we can really find some variable in T2 that is dependent on variables in T1.

For each variable in T2, I started pulling out all data in T1 having same id and date pair which resulted in much smaller ~50K data points for some of test variables.

The problems that I am facing now with application of glm is as follows.

In some cases, it shows me an error 'fit not found' and warning 'glm.fit: algorithm did not converge '. I am not sure why is it shown?
How the NAs are treated in glm? Does it remove all records involving 'NA' first and then perform fitting?
Is it a good strategy to remove all NAs first and then call 'glm'. I fear that this may reduce the datapoints significantly as most of them are NAs.
Which method is used to calculate the coefficients. I couldnot find any website or paper or book that discuss how the output is calculated.

I tested glm with and without 'NAs' and found difft answers which points that NAs are considered while fitting the data:

Example 1:

> tmpData
  x1 x2 x3        Y
1  1  1  1        3
2  1  0  4        5
3  1  2  3        6
4  0  3  1        4

Call:  glm(formula = as.formula(paste(dep, " ~ ", paste(xn, collapse = "+"))), 
    na.action = na.exclude)

Coefficients:
                      (Intercept)  as.numeric(unlist(tmpData["x1"]))  as.numeric(unlist(tmpData["x2"]))  
                        5.551e-16                          1.000e+00                          1.000e+00  
as.numeric(unlist(tmpData["x3"]))  
                        1.000e+00  

Degrees of Freedom: 3 Total (i.e. Null);  0 Residual
Null Deviance:      5 
Residual Deviance: 9.861e-31    AIC: -260.6

Example 2:

'x1'    'x2'    'x3'    'Y'
100000  1   NA  2
1   1   1   3
1   NA  -1124   2
1   0   4   5
1   2   3   6
0   3   1   4



Coefficients:
                      (Intercept)  as.numeric(unlist(tmpData["x1"]))  as.numeric(unlist(tmpData["x2"]))  as.numeric(unlist(tmpData["x3"]))  
                       -2.3749044                         -0.0000625                          0.6249899                          1.8749937  

Degrees of Freedom: 5 Total (i.e. Null);  2 Residual
Null Deviance:      13.33 
Residual Deviance: 1.875    AIC: 20.05

Best Answer

NA Handling: You can control how glm handles missing data. glm() has an argument na.action which indicates which of the following generic functions should be used by glm to handle NA in the data:

na.omit and na.exclude: observations are removed if they contain any missing values; if na.exclude is used some functions will pad residuals and predictions to the correct length by inserting NAs for omitted cases.
na.pass: keep all data, including NAs
na.fail: returns the object only if it contains no missing values

If you don't set na.action, glm() will check R's global options to see if a default is set there. You can access your options with getOption("na.action") or options("na.action") and you can set it with, for example, options(na.action = "na.omit") However, from the R output you provide in example 1, it seems that you are setting na.action = na.omit. So, yes, in that instance at least, you are removing all cases/rows with NAs before fitting. Moreover, I am pretty sure na.action = na.pass would cause glm() to fail when the data have NAs (try it).

Errors: glm() is using an iterative procedure (iterated weighted least squares; IWLS) to make maximum-likelihood estimates. You sometimes get errors because it will only go through a predefined number of iterations and, if it doesn't have a good fit then, it gives up. This number is controlled by the argument maxit, which by default is maxit = 25 . You could try setting it higher, though of course, this will take longer. (If you set trace=TRUE it will show you the outcome of each iteration.)

Other sources of information: The helpfile for glm is accessible with ?glm or help(glm) and explains much of this. Two other useful resources are:

Intro to GLMs lecture notes and exercises from Heather Turner
Modern Applied Statistics with S, Fourth Edition. W. N. Venables and B. D. Ripley. Springer, 2002 (if I recall correctly)

Related Solutions

Solved – Accounting for overdispersion in binomial glm using proportions, without quasibinomial

Overdispersion occurs for a number of reasons, but often the case of presence/absence data is because of clustering of observations and correlations between observations.

Taken from Brostrom & Holmberg (2011) Generalised Linear Models with Clustered Data: Fixed and random effects models with glmmML

"Generally speaking, a random effects model is appropriate if the observed clusters may be regarded as a random sample from a (large, possibly infinite) pool of possible clusters. The observed clusters are of no practical interest per se, but the distribution in the pool is. Or this distribution is regarded as a nuisance that needs to be controlled for."

https://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf

library(lme4) 
library(RVAideMemoire)
Data$obs <- factor(formatC(1:nrow(Data), flag="0", width = 3))
model.glmm <- glmer(cbind(number_pres,number_abs) ~ Var1+Var2+Var3+Var4...+
(1|obs),family = binomial (link = logit),data = Data) 
overdisp.glmer(model.glmm) #Overdispersion for GLMM

Solved – Improving Logistic Regression model’s summary output

As these data are based on employee records, you presumably have data on the time to quitting (length of employment), not just the fact of having quit. If so, this would be better modeled with survival analysis. Predicting the length of employment would seem to be of considerable value to the company.

Then the dependent variable is continuous, with those who haven't quit yet treated as "censored" observations. (We all do, eventually, end up leaving employment.)

Whether you model this as logistic or survival, you should carefully limit the number of variables under consideration or use a penalized method like LASSO or elastic net. The rule of thumb to avoid overfitting if you are not using a penalized method is to consider no more than one variable per 15 events. That would be the number who quit or otherwise left employment for survival analysis, or the smaller of those who quit/didn't quit for logistic (which, the more I think on it, seems less and less useful here). And in terms of the number of variables, each categorical variable counts as one less than the total number of categories (that's how many columns it contributes to the model matrix).

To make this concrete, say that 600 out of the 1252 cases represented people who left employment with the company. If you intend to do standard survival analysis, this rule of thumb means that you should enter no more than about 600/15=40 variables (columns of a model matrix) into your analysis, not the full model matrix with 224 columns. If only 300 people in your data set left employment, only 20 variables should be considered in standard survival analysis. The particular variables might best be selected based on your knowledge of the subject matter, or multiple correlated predictors might be combined into single predictors. If you need to evaluate more predictors than warranted by this rule of thumb you should use a penalized method.

Best Answer

Related Solutions

Solved – Accounting for overdispersion in binomial glm using proportions, without quasibinomial

Solved – Improving Logistic Regression model’s summary output

Related Question