Solved – Does using count data as independent variable violate any of GLM assumptions

count-datageneralized linear model

I would like to employ count data as covariates while fitting a logistic
regression model. My question is:

  • Do I violate any assumption of the logistic (and, more in general, of the
    generalized linear) models by employing count, non-negative integer
    variables as independent variables?

I found a lot of references in the literature regarding hot to use count
data as outcome, but not as covariates; see for example the very clear
paper: "N E Breslow (1996) Generalized Linear Models: Checking Assumptions
and Strengthening Conclusions, Congresso Nazionale Societa Italiana di
Biometria, Cortona June 1995", available at
http://biostat.georgiahealth.edu/~dryu/course/stat9110spring12/land16_ref.pdf.

Loosely speaking, it seems that glm assumptions may be expressed as follows:

  • iid residuals;
  • the link function must correctly represent the relationship among dependent
    and independent variables;
  • absence of outliers

Does everybody know whether there exists any other assumption/technical
problem that may suggest to use some other type of models for dealing with
count covariates?

Finally, please notice that my data contain relatively few samples (<100)
and that count variables' ranges can vary within 3-4 order of magnitude
(i.e. some variables has value in the range 0-10, while other variables may
have values within 0-10000).

A simple R example code follows:

\###########################################################

\#generating simulated data

var1 <- sample(0:10, 100, replace = TRUE);    
var2 <- sample(0:1000, 100, replace = TRUE);    
var3 <- sample(0:100000, 100, replace = TRUE);    
outcome <- sample(0:1, 100, replace = TRUE);
dataset <- data.frame(outcome, var1, var2, var3);

\#fitting the model

model <- glm(outcome ~ ., family=binomial, data = dataset)

\#inspecting the model

print(model)

\###########################################################

Best Answer

There are some nuances at play here, and they may be creating some confusion.

You state that you understand the assumptions of a logistic regression include "iid residuals... ". I would argue that this is not quite correct. We generally do say that about the General Linear Model (i.e., regression), but in that case it means that the residuals are independent of each other, with the same distribution (typically normal) having the same mean (0), and variance (i.e., constant variance: homogeneity of variance / homoscedasticity). Note however that for the Bernoulli distribution and the Binomial distribution, the variance is a function of the mean. Thus, the variance couldn't be constant, unless the covariate were perfectly unrelated to the response. That would be an assumption so restrictive as to render logistic regression worthless. I note that in the abstract of the pdf you cite, it lists the assumptions starting with "the statistical independence of the observations", which we might call i-but-not-id (without meaning to be too cute about it).

Next, as @kjetilbhalvorsen notes in the comment above, covariate values (i.e., your independent variables) are assumed to be fixed in the Generalized Linear Model. That is, no particular distributional assumptions are made. Thus, it does not matter if they are counts or not, nor if they range from 0 to 10, from 1 to 10000, or from -3.1415927 to -2.718281828.

One thing to consider, however, as @whuber notes, if you have a small number of data that are very extreme on one of the covariate dimensions, those points could have a great deal of influence over the results of your analysis. That is, you might get a certain result only because of those points. One way to think about this is to do a kind of sensitivity analysis by fitting your model both with and without those data included. You may believe it is safer or more appropriate to drop those observations, use some form of robust statistical analysis, or to transform those covariates so as to minimize the extreme leverage those points would have. I would not characterize these considerations as "assumptions", but they are certainly important considerations in developing an appropriate model.