Solved – cross validation method issues when evaluating biased data set

cross-validationmachine learningprecision-recallpythonscikit learn

I think when cross validation calculate precision and recall (refer to my code above), it calculate average precision and recall of each class? Which means if prediction of one class has high precision/recall, the other class has relatively low precision/recall, the final results are still ok?

Suppose a two class classification problem. One class has more than 95% of labelled data, and the other class has 5% of labelled data. The two class are very biased.

I am doing class validation to evaluate different classifiers, I found if a classifier intentionally to predict to the class which has majority (95%) label, even if the prediction result on other class is not accurate, from precision/recall, it is hard to distinguish since the other class has only 5% labelled data.

Here are the methods/metrics (using precision/recall) I am using. I am wondering if any other better metrics or method to evaluate considering the minor 5% class? I am assign a weight to the minor 5% class, but I ask here for a more systematic method to measure biased data set.

If there are any solution in scikit-learn, it will be great.

BTW, do you think if average of precision/precision of two class is ok even if for biased data, since in model training, we treat two class as equal weight?

Using scikit learn + python 2.7.

scores = cross_validation.cross_val_score(bdt, X, Y, cv=10, scoring='recall_weighted')
print("Recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_validation.cross_val_score(bdt, X, Y, cv=10, scoring='precision_weighted')
print("Precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

            precision    recall  f1-score   support
         0       0.95      0.99      0.97       941
         1       0.45      0.10      0.16        51
avg / total       0.93      0.95      0.93       992

Best Answer

Almost all practical two-class problems are unbalanced with predicting '1'or minority is always more important. e.g., fraud detection

To evaluate these models precision, recall) and F1 (https://en.wikipedia.org/wiki/F1_score) score are widely used. Recall & precision are not used in isolation but they are defined with one fixed. (please note that all of these are low in your output).

You can also use ROC & Kappa as well. ROC is useful in understanding the classifier as well as deciding the trade-off between accuracy & precision.

e.g., - For fraud detection you would want to tag all of the frauds correctly (minority class) even if it means few of the zero's are classified incorrectly. To achieve this you would basically change the probability threshold in the ROC curve. This will though reduce the accuracy

Some quick reference - http://www.biostat.umn.edu/~susant/SPRING09PH6415/Lesson4c.pdf http://mchp-appserv.cpe.umanitoba.ca/viewConcept.php?printer=Y&conceptID=1047)

The bigger issue is to handle the imbalanced data set problem, which can be dealt in few different ways

Related Question