Solved – Very Confused: Getting AUC of 1 and 100% accuracy for classification task

logisticmachine learningpythonscikit learnseparation

I am building a healthcare readmission model. It is a binary classification task. I had around 90K observations with close 500 features. Except 9-10 features, rest all are binary features.

I did 5 fold cross validation and ran three algos.
Stochastic gradient descent SVM, Vanilla Logistic regression, Logistic regression with elastic net regularization.

There is a big class imbalance problem and it is around 8:1 in favor of negative class. Here is the count of the dependent variable values.

model_data1.readmission.value_counts()
Out[101]:
0    80571
1     9717
dtype: int64

Because the class imbalance is there I used the class_weights parameter in sklearn algos and weighted it 1:8 in favor of positive class. Here is one example code:

sgd_lr=SGDClassifier(loss='log',penalty='elasticnet',alpha=0.002,l1_ratio=0.70,class_weight={0:1,1:8})

And Here is the diagnostic statistics of all the three algorithms:

('The mean accuracy of Stochastic Gradient Descent SVM on CV data is:', 0.99944621379697762)
('The mean accuracy of Logistic regression on CV data is:', 1.0)
('The mean accuracy of Stochastic Gradient Descent Logistic on CV data is:', 0.99968355807253673)
('The accuracy of SGD SVM on test data is:', 0.99929855650311961)
Classification Metrics for
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     24134
          1       0.99      1.00      1.00      2953

avg / total       1.00      1.00      1.00     27087

Confusion matrix
[[24117    17]
 [    2  2951]]
('The accuracy of Logistic with Elastic Net on test data is:', 0.99963081921216823)
Classification Metrics for
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     24134
          1       1.00      1.00      1.00      2953

avg / total       1.00      1.00      1.00     27087

Confusion matrix
[[24127     7]
 [    3  2950]]
('The accuracy of Logistic Regression on test data is:', 1.0)
Classification Metrics for
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     24134
          1       1.00      1.00      1.00      2953

avg / total       1.00      1.00      1.00     27087

Confusion matrix
[[24134     0]
 [    0  2953]]
ROC Curve for Stochastic Gradient Descent SVM:


ROC Curve for Stochastic Gradient Descent Logistic Regression with Elastic Net:


ROC Curve for Stochastic Gradient Descent Logistic Regression with L1:



In [ ]:

And here is the ROC curve.

enter image description here

I am not sure this can be true. One cannot get 100% accuracy on test data in a machine learning exercise. That is very unbelievable. Please advise what is going wrong here?.

Thanks

Best Answer

This situation can occur in logistic regression with a large number of candidate predictors, even with thousands of cases as you have here. With 500 binary predictor variables you have $2^{500} = 3.3 \times 10^{150} $ possible combinations of predictors. This is approximately the square of the number of atoms in the universe. You have found a particular combination of predictors that completely distinguish the two groups in this particular data set. This result is unlikely to generalize to new cases, as you recognize.

For ideas about how to proceed, follow the hauck-donner-effect tag on this site. This page and this page are good places to start.

In response to comment:

If you set aside separate training and test sets, either you got really lucky in finding a reproducible linear separation in your training set, or some predictor is acting as a proxy for readmission. You are unlikely, unfortunately, to have solved the vexing problem of predicting readmission, where an AUC of 0.8 might be considered spectacular.

Make sure that the "normal features" you are examining are only those with data available at the time of prior discharge (before the readmission/no-readmission cutoff time). With 500 features (a lot for clinical data) it's possible that one slipped through that would only have a particular value for readmitted patients. Follow the suggestions by @spdrnl in the comments to examine the individual predictors and combinations in more detail.