Missing Data – How to Replace ‘Not Applicable’ or ‘N/A’ Values with Zero

data-imputationmissing data

I am trying to compare user ratings of various products, the majority of which come in several standard versions. However some of the products do not come in certain versions meaning that my data set has several N/A values.

The data is continuous interval type data with a range of -100 to +100

My question is, when should these values be replaced by 0?

            V1      V2      V3      V4      V5      V6      V7      V8      V9
Product 1   2.63    -5.12   -0.41   5.29    9.89    4.16    14.73   9.06    -7.80
Product 2   0.60    0.94    4.47    N/A     0.12    21.47   N/A     -4.63   1.29
Product 3   5.53    -16.20  -19.56  N/A     2.24    N/A     15.07   -3.47   -6.93

With N/A values included, excel tells me the average user rating given to each product is

Product 1 = 3.60
Product 2 = 3.47
Product 3 = -3.33

However if I replace the values with 0 then it changes the scores:

Product 1 = 3.60
Product 2 = 2.70
Product 3 = -2.59

I am sure others have dealt with this question before but I am not sure what to do. In future I want to undertake t-tests or z-tests on the data.

Some Research

To be honest apart from a little bit here and some questions on Research Gate I cant find a lot on this topic, I suspect I am using the wrong search terms however. The below is what I have got so far

There appears to be lots of questions about how to replace N/A values in datasets but not when or whether it should be done

There are a few other simalar questions on CV but they have not received an answer, e.g. here

A comment in reply to a question here, only slightly similar question suggests setting NA values to the minimum of the range, but this would significantly change my results…

P.S. I am really not sure what to tag this question with so if anyone could apply better tags it would be much appreciated.

Thanks

Best Answer

To formalize @Whuber's comment in an answer, it seems like no, you shouldn't impute here.

From the range of available scores, I'll assume that a $0$ denotes a neutral opinion of that product/version combination. If you were to impute values of $0$ for your NAs, what you're saying is that the average user rating of that product/version combination is neutral. However, that doesn't make any sense, given that those product/version combinations don't exist at all! If a given product/version combination doesn't exist, it would be inappropriate to assign a numerical user rating to represent an opinion of a nonexistent product/version combination.

If the NAs represented missingness due to lack of data, imputation might be an option (though there are other options beyond imputation, including doing nothing). But that doesn't seem like the case here.

Best Answer

Related Solutions

Missing Data – How to Address Problems with Missing Values and Data Imputation Techniques

Solved – Is it better to replace missing values by mean or mean by class

Related Question