Solved – How to improve rare event binary classification performance

boostingmachine learningrare-events

I am building a binary classification model to predict patient admission with respiratory issue in R. Each row in my data set is a patient record. The dependent variable is admit or not(1 or 0), and the features including age, gender, weather info, and air quality info. All the variables are in numeric type. The data set contains 70,000 records with admit rate around 3%.

I searched online for possible technique to deal with rare event problem. Like using xgboost, or resampling data set and combine with ML algorithms.

I am using three algorithms to compare the results.
1. xgboost. My codes are:

   pos_weight <- sum(training_df$resp_admit==0)/sum(training_df$resp_admit==1)
   xgb_mod <- xgboost(data = dtrain,
                      eta = 0.01,
                      max_depth = 9,  nround=3000, nthread = 2,
                      subsample = 0.9, colsample_bytree = 0.9,
                      eval_metric = "error", eval_metric = "auc",
                      objective = "binary:logistic",
                      max_delta_step = 6,
                      scale_pos_weight = pos_weight,
                      verbose = 1)

2. logistic regression and decision tree with resampling method(ovun.sample with method = both)

However, with parameter tuning and resampling method, the classifier performance is still bad for me.

   metric      logistic regression     xgboost
Sensitivity       0.0240480962         0.34482759
Specificity       0.9703949693         0.61143868
Pos Pred Value    0.0278422274         0.02945508
Neg Pred Value    0.9657548696         0.96465116
Precision         0.0278422274         0.02945508
Recall            0.0240480962         0.34482759
F1                0.0258064516         0.05427408
Balanced Accuracy 0.4972215327         0.47813313
AUC               0.4901495211         0.45190509

I am new to rare event problem. My questions are:

  1. I know the standard for the performance is different for different problem. But is there a overall standard? Like at least recall or precision should above 0.5?
  2. What other techniques can I try to improve the performance?
  3. Did I miss anything I should pay attention when I built the model?
  4. I randomly picked some basic features that I thought are important to build a simple model. Should I also do feature engineering? Does this step matter a lot to the model performance?

The following is how I constructed my data set.

  1. From all the patient admission records, picked out the ones admitted with respiratory issues. These records are associated with admission date. One patient may have multiple records at different admission dates. (Patients with multiple admissions count 15% among all the patients.) Labeled dependent variable for these records as 1.
  2. Group the above records by year-month. For each day in a month, if the patient didn't admit on that day, replicates patient info, and label as 0. Do this for all the patients fall in that month, and repeat the procedure for each different year-month.
    The reason I didn't generate 0 records across the whole time period is that if I did so, the rare event rate will be around 0.1%.
  3. Combine all the 1 and 0 records, left join the weather and air quality info by date.

I am also concerning about the way I constructed my data set. Discussions about this are welcomed.

The final goal of this model is to make respiratory admission risk predictions on a patient list, based on patient info and current day weather and air quality info.

Best Answer

Casting this as a classification problem was a major misstep. This is inherently a "tendency estimation", i.e., probability estimation problem. That is what logistic regression is all about. And you've chosen improper accuracy scores - scores that are optimized by choosing the wrong features and giving them the wrong weights. For details see http://www.fharrell.com/2017/01/classification-vs-prediction.html and http://www.fharrell.com/2017/03/damage-caused-by-classification.html

Related Question