Categorical Data – How to Adjust for Missing Values in Categorical Variables Using Data Imputation

I Have a data set containing about 40 categorical variables. I am trying to factor analyze them. But each categorical variable contains a good number of missing values. Some of them are simply because of non-response. The respondent did not filled up any answer option for that question. Some of them are due to questions of the following type:

5) Do you create formal work teams in your institution?
 1="NO" 2="YES"
(Please skip question number 6 and 7 whose answer to this question
is 1="NO")

6) How many members form the work team? (for example)

7) What is the criterion of selecting team members? (for example)

Now those who answered "NO" for question number 5 will not answer 6 and 7. He will again start from 8. This is another source of missing information or gap in the data set. Because of specially this type of missing values if I omit missings listwise a lots of information is missed.

So, I am looking for adjusting these missing values. I don't know how to adjust these missing values (of both non-response and the second type I mentioned) for categorical variables. Taking mean, median or even EM algorithm may not be appropriate for categorical variable I guess. So, what should be done and how?

My actual number of observations is 212, but it reduces to only 42 when I use na.omit(data).

Best Answer

As for your non-response case, you might use multiple imputation or, more easy but nevertheless good method, hot-deck imputation. The former is most universal but the latter needs the background variables (the ones by which matching between recipient missing observations and donor non-missing observations takes place) be categorical. Given that your data are mostly categorical, hot-deck method will suit. Both methods are applicable for MAR (missing-at-random) pattern for which listwise deletion or mean/median substitution aren't applicable.

As for your non-questioned case, I believe no special imputation procedure is either helpful or needed. Logically, if a question was not asked because there is single and obvious response (How many members form the work team? - One, me) then you can add this response option as if it were in the questionnaire. But if possible response is ambiguous (What is the criterion of selecting team members? - a) I work single because I'm confident in me; b) I work single because I'm shy to show my incopetence; etc) there's no way out except to drop such questions altogether or to drop everybody not asked such questions.

Best Answer

Related Solutions

Solved – Hot deck imputation: validity of double imputation and selection of deck variables for a regression

Solved – using random forest for missing data imputation in categorical variables ( in R)

Related Question