Solved – Imputing genuinely missing data

data-imputationr

I am working with loan data. I have a field for "months since last delinquency". If a borrower has not been delinquent on any of their accounts for the past 7 years, this field has missing value (NA). As you can see, this is a "genuine" case of missing data.

In the dataset I got, the field is missing only for less than 4% of the data points (306 out of 8965), but dropping the rows will exclude the "good" borrowers and bias the dataset. Also, I believe this field is of value for prediction purposes, so I don't want to remove it.

I know tree-based models can handle missing values. In fact, I already have a model built with XGBoost and it has decent performance.

Now I want to build a simpler linear regression model (with regularization) for making the case that the XGBoost model is worth its complexity. This requires me to impute these missing values.

What is the value I can use for imputing the missing values? Setting it to 84 (number of months in 7 years) seems to make some sense, but that would mean the borrower was last delinquent 84 months ago, which is not true. I am also worried about imputing the value to something very large (like 999), since these points may then have high leverage.

Here is the summary of the data (in R code):

> nrow(loans)
[1] 8965

> summary(loans$MONTHS_SINCE_DEL)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   1.000   5.058   3.000  81.000     306 

How does one deal with this problem in practice, when working with models that cannot handle missing values?

Best Answer

One possible solution... Discretize the "months since last delinquency" variable and into categories such as:

  1. Never delinquent (based on NAs)
  2. Last delinquent 0 to 12 months ago
  3. Last delinquent 13 to 34 months ago
  4. And so on...

Or use some modification of this, and use dummy variables in the regression to estimate the parameters associated with each category. Playing around with the category definitions to find appropriate cut points would be good.

Pros:

  • It incorporates all the available information without imputation
  • It may have the benefit of capturing different effect sizes of time since delinquency (rather than assuming a linear slope).

Cons:

  • It does require more parameters to be estimated in your model
  • Coefficients for dummy variables won't have the standard linear coefficient interpretation. But they're still relatively straightforward to interpret.

Edit: You do not have missing data, therefore imputation is not justified at a theoretical level. At a more practical level, if you were to impute the NAs, they would be assigned values that will have to be based on the data (people) that already exist in full. This will give the "non-delinquents" values of around 0 to 81, meaning the imputed data would essentially say, these people actually are delinquents. This is untrue and will bias your model. Additionally, your model will only be applicable to "delinquents" since it will only have data for "delinquents" in it (imputed or otherwise).

Related Question