Solved – Missing data in a logistic regression analysis

datasetmissing dataregression

I am performing a logistic regression analysis with all my data dichotomized (0/1). My problem is that I have a lot of missing data and the analysis eliminates participants listwise (i.e., eliminates a participant even if they have only 1 missing at any variable).

I do not want to impute the missing data but I am wondering if the analysis could be run for different number of participants per variable. For instance, gender and bullying has 500 complete cases but nationality has 450 complete cases and 50 missing. If bullying is my dependent variable, gender comparisons would include 500 participants but nationality would include 450.

What do you think?

Best Answer

If you are referring to a multivariate analysis, the approach of "dropping mostly incomplete factors" may be called a complete factor analysis. Here, inclusion of a variable in a model is conditional upon the completeness of its observations. If, for instance, 20% or more of the values for a variable are missing, we might make a rule to omit that variable.

Getting more n into the analysis sample may tighten CIs which is only reason not to consider complete case analysis/listwise deletion, or it may not. Dropping factors will change the interpretation and long-run behavior of the estimates, however. There are reasons to consider complete factor analysis for a sensitivity analysis, but not as a primary analysis.

We generally don't do complete factor analyses because such analyses are biased. The factor you chose to omit was part of a prespecified analysis plan, and thus served an important role in the model for two functions: 1) It is prognostic of the outcome and/or 2) It stratifies or reduces confounding. Because of the non-collapsibility of the logit link, you cannot simply "throw measures out" because they do not have the data properties you desire, the inference and estimates will consequently change and you wind up answering a different question. If this omitted factor is a confounder, these problems are moreso problematic.

enter image description here

In general, it is preferable to conduct--as a main analysis--inference which is inefficient as opposed to biased.


By contrast, if your analysis proposes several separate comparisons, I frequently see uneven n's between those comparisons. This is mainly due to complete case analysis, AKA listwise deletion. That approach is generally considered sane because the assumption that responders are the basis of a representative sample holds in each case. So one comparison having N=500 and another having N=450 simply means there were 500 responders in one analysis sample and 450 in the next. Describing missing data carefully helps the readers understand the meaning and impact of this approach.

Related Question