Solved – Replace missing values in a continuous variable with some predefined number, possibly a negative number

data-imputationgeneralized linear modelgeneralized-additive-modelmissing data

Is it an acceptable idea to replace missing values with some arbitrary negative number like -99 in a continuous variable that contains only zeros and positive values besides NAs?

In my dataset I have continuous variables, predictors in credit risk models. I've made models with binned versions of these variables but due to well-known disadvantages of binning I would like to build models with continuous values, GAM, GLM. The problem is that missing values in those variables are quite different from observations with zeros: missing values for outstanding amount on car loans means a person doesn't have any car loans, whereas zero means a person has zero debt on his car loan.
I googled and researched, got confused & perplexed but haven't found a solution that seems to suit requirements of my case.

Replacing NAs with zeros is not a good idea because persons with zero/close-to-zero balances have lowest default rate (say 1%) whereas default rate for persons with NAs is slightly higher than average portfolio default rate (say 5%).
Neither imputation seems like a good option: 1) share of NAs in some variables can be as much as 50%, i.e. sparse observations, 2) if I use imputation, and then the model is deployed into production, values would have to be imputed (i.e. "made up") from credit reports even though those real people actually have NAs. That is data manipulation, a questionable thing to do.

What if I replace NAs with some negative value like -999 while remaining observations will be in range of 0 to infinity? What troubles would I bring by doing so? Will that make model fitting with GAM, GLM, Lasso complicated and error-prone, can those algorithms deal well with such data pattern, i.e. continuous positive values and some predefined negative value?

UPDATE: I wrote a custom function that calculates WoE, IV, bad rate, etc. per bin on the BINNED dataset where NAs and zeros are separate bins. In variables where the bad rate and WoE for the NA bin and the ZERO bin are similar I replace NAs with zeros; otherwise I simply replace NAs with -99 (minus 99, yes, because all positive values are occupied by actual, real observations). Then I trained classification models (RF, gbm, etc.) on the binned dataset, the unbinned one (with zeros and -99s instead of NAs), and also on a mixed dataset where some numeric variables like loan amounts are binned and others are unbinned.
Also I replaced NAs with "MISSING" and created a level out of NAs in factor variables.

Bottomline: models on unbinned data, even with this arbitrary -99 value, seem to seriously outperform binned ones. Kappa and Specificity are 8-10% higher, Sens and Acc are more or less the same, above 97% anyway. This kind of confirms (still in progress) that binning does steal a lot of information – there is plenty of posts as to why binning is detrimental and should be avoided.

I am posting an update because this methodological issue bogged me down for quite a while. Seems like I came up with an imperfect but working solution. Comments are welcome!

P.S. In my case missing values are clearly MNAR, missing not at random. So imputation is not an option, in my datasets NAs are not MCAR.

Best Answer

You should use a missing data code that would mess up the results enough that one could look at them and immediately realize something is awry. This is the reason people often use 999. In many data analysis contexts, that is way out of range for a variable, so the results produced couldn't be mistaken for legitimate results. A missing data code of "0" can be insidious. And using NA can be insidious too, because it may silently cause listwise deletion, as it does in regression functions implemented in R. Listwise deletion is often the preferred approach because of its simplicity, but it's best to consciously make that decision rather than having it done silently by default.

In your case, I don't see why you need a missing data code that coalesces well with the legitimate values for the predictor. So, I don't see why it matters that a large negative number would be inconsistent with the rest of the response values. I'd say that having a missing data code that will set off an alarm if it's accidentally treated as a numerical value is a good thing.

Related Question