Solved – How much sparsity is a problem for logistic regression

assumptionslogisticregressionsparse

I keep reading that sparsity (the number of cells with 0 observations in a cross tabulation of all variables in a model) is a problem for running logistic regression models because it biases odds ratios and wald tests upwards etc.. I think most datasets will have varying degrees of sparsity, but are there rules of thumb or formal tests for determining when sparsity is likely to lead to biased results for logistic regression models?

EDIT

My source for this claim regarding sparsity:

Cohen, J., Cohen, P., West, S.G., Aiken, L.S. (2003) Applied Multiple Regression/Correlation Analysis for the Behavioural Sciences. New Jersey: Lawrence Erlbaum and Associates https://books.google.com.au/books?id=fAnSOgbdFXIC&pg=PT657&dq=cohen+regression+sparse&hl=en&sa=X&ved=0ahUKEwjVo–iicPZAhWDupQKHbwaAq8Q6AEIKTAA#v=onepage&q=cohen%20regression%20sparse&f=false

Greenland, S., & Altman, D.G. (2016). Sparse data bias a problem hiding in plain sight. BMJ, 352.

http://www.bmj.com/content/352/bmj.i1981.full.print

Best Answer

It seems as though there are two approaches. The first is to look at a cross tabulation of the data and to consider how much a small change in the distribution e.g. moving an observation from one cell to another influences the size of an effect (e.g. odds ratios).

The second approach is to compare the effects obtained with those from a penalised regression model. If the effects do not differ substantially, then sparsity is probably not an issue, if they do then it suggests that sparsity has influenced your models results to some extent.

Source: Greenland, S., & Altman, D.G. (2016). Sparse data bias a problem hiding in plain sight. BMJ, 352.

(http://www.bmj.com/content/352/bmj.i1981.full.print)

Related Question