Solved – How to handle missing data when determining differences between groups using chi-squared or Fisher’s exact test

chi-squared-testepidemiologyfishers-exact-testmissing datar

I have 168 rows of patient data: 104 controls and 64 cases. I want to know if albumin status (low or high) is related to case/control status. I made a table using R:

> table(Albumin, Status, useNA = "ifany")
Albumin    Control  Case
    Low    51       16
    High   39       32
    <NA>   14       16

As you can see, I have missing data. I did a chi-squared test on the entire table:

> chisq.test(table(Albumin, Status, useNA = "ifany"))$p.value
[1] 0.006222513

Question: Should I perform the test on the 3×2 table above that includes the missing data? Or should I perform it on a 2×2 table that excludes the missing data, as shown below?

> chisq.test(table(Albumin, Status))$p.value
[1] 0.01496166

Problem: In this example, both approaches yield significant p-values. However, I have other variables for which the difference is insignificant when missing values are excluded, but significant when they are included. I have some variables with only one missing value, as well.

Question: How should I apply the chi-squared test in those situations? Is my choice of test correct, or should I be using Fisher's exact test or some other test? And are there any diagnostics that I need to do before even applying these tests?

Best Answer

Unless there is some specific reason for people being NA, and unless you are interested in that reason, then I would say to not include people who are missing.

You don't need an exact test here; all the cell sizes are reasonable.

However 1) Don't you want some form of regression instead? and 2) Why is Albumin dichotomized into low and high? Dichotomizing continuous variables is usually a bad idea (see Royston, Altman & Sauerbrei).

If you have actual values for albumin, I suggest a linear regression albumin~case, possibly with other covariates added, if you have data. This is especially important if this is an observational study, but is still worthwhile if it is an experimental one, because covariates can vary between groups, even if assignment is random, and because covariates can affect other regressors.

Related Question