Solved – How to best code the N/A response of the Likert-type rating scale

categorical-encodinglikertmissing dataordinal-datascales

Say I have a dataset of people's opinion/"rating" on something, and they have to choose 1 out of 5 possible answers for each question – Very happy, Happy, Neither happy nor unhappy, Unhappy and Very unhappy. In addition, some people don't answer anything – so let's say we mark these as N/A. We can map Very happy to 5, Happy to 4, and so on until Very unhappy is 1. But, how should N/A be mapped?

  1. Should I regularize it and make it the mean of the possible options (i.e., 3)? What are the disadvantages of this method?
  2. Should I make it 0, in which case it might affect some prediction results later?
  3. Should I convert the features into one-hot encoding, in which case if there are n columns with these same possible answers, we'll have increased the number of features to 6n?
  4. Should I make an extra column for only the N/A as one-hot encoding, so that there are 2 columns for rating – one for "non-N/A" (or valid scores), and one for N/A?
  5. Should I randomly assign a number from 1 to 5 to the N/As?

Best Answer

Some of the answers here seem more complicated or hi-falutin' than may be needed or indeed justified. For example, in many projects short of say Ph.D. level, getting into imputation may be beyond the time available or the skill level expected.

Also, the title says "N/A" which often means "not applicable". In the question itself the OP treats N/A as the researcher's coding for blank answers, in effect "no answer". Some of my answer covers N/A as a deliberate and allowed reply to a question. There is a lot of turbulent water between these interpretations.

It seems to me that the simplest possibilities have not yet been mentioned at all.

  1. N/A is just another category. A side-effect of that: the scale is no longer ordinal.

  2. Leave out the N/As from the data analysis. That is ethically sensitive as well as statistically sensible. Seriously, if I fill in a questionnaire and am given N/A as an option and then use it, that's my right within the survey I undertake. It should be taken that I meant what I said. Suppose that I really don't play golf or use Twitter or whatever it is, and that is why a question on preference for golf clubs or my Twitter attitudes really doesn't apply. I don't want some fool of a researcher analysing the data to presume or assume that I "really" meant something else and that the rest of my answers somehow are informative on what that might be. This also applies if I leave a question unanswered, given scope to do that.

  3. There is one alternative that occasionally may make sense, which is to guess that N/A is in some sense equivalent to the neutral category. This has to be argued carefully, but the spirit is, perhaps, that for some questions they may both be flavours of "don't care".

  4. Whatever the research strategy, if N/As are at all common, then a really careful study will include some kind of sensitivity analysis and explore different ways of analysing them, and explain how much difference it makes to the results of the study. Researchers often prefer not to do this, partly because it tends to underline how lousy the data are.