Solved – The use of Zero Value as Missing Values and Outliers

datasetexploratory-data-analysis

I am not sure what is the best title for this question. Sorry for that.

Let's say I analyzing a dataset of prices of houses. And one of the columns (features) is year_of_renovation. Since, not every house was ever renovated, a good portion of them has zero value in this field.

Personally, I considered the zero value to be a missing value N.A.. Is that correct?

I do so because I believe columns such this one could actually be broken down into two columns, was_renovated, and year_of_renovation. Of course we don't need to break the information into these two columns, but these could be considered two different kind of information. Right?

I guess there are two scenarios:

Scenario 1

Having zero values in year_of_renovation means the house was never renovated. Therefore zero values are not missing values, in a first thought as N.A is.

Scenario 2

Having zero value meaning that the year_of_renovation is unknown. In this scenario, it seems that zero values are inter-exchanged/means the as as N.A. values, right?


Regardless of considering zero values the same as missing values or not, I have another two questions.

  1. Would it be wrong to consider zero values as outliers?
  2. Can zero values, when used to represent missing values (such as max_int, min_int …), be also considered outliers?

Best Answer

When zero is a valid value for a variable (i.e., not missing) then use it in the analysis. However, for home price, when zero is replaced for price since it's missing, delete the zeroes, since most software will recognize the missing fields as missing. You don't ever want to (e.g.) calculate average house price by including home with price values of zero, since it will artificially lower the average - and that it not a true average based on houses that didn't have a missing price.

The other problem you will learn when zero is used for missing is that it will often result in a large spike at zero in a histogram for a variable.

In summary, if zero truly represents something, then use it. It could be an outlier quite easily. You might report e.g. average and s.d. for renovation of homes that were renovated, and then list the number of homes that weren't renovated, since all the remaining homes that were renovated will have renovation costs much larger.

Zero values that are true are usually not an outlier if the range of values falls in e.g. [0,1], [0,2] or [-2,2]. When non-zero values are large however, like home renovation costs, I would not use the zero values, and would just say for example there were 30 homes that weren't renovated (zero), and then give the average and s.d. of renovation costs for homes that were renovated for which the cost is non-zero.

Related Question