Solved – Optimizing for target metrics in Weka

binary dataclassificationmetricoptimizationweka

I'm a PhD student in Information Retrieval with some limited experience in ML. We've been working on a binary classification task with weka (I'm using weka programmatically via Java), specifically with Random Forest.

Our results are coming out a little weird because we have an unbalanced dataset (85/15ish). We're getting very high % correct, but precision and recall are very low for our target class (the 15% one).

My new understanding is that % correct is really not the right metric to be looking at. The professor I work with said (and I quote): "You are measuring accuracy with "percent correct". This is so rarely done in machine learning papers these days that I just blew right past it." He also referenced a paper explaining why I shouldn't use % correct as Accuracy [1].

In our case, we are interested in precision and recall to some extent, but the professor I'm working with (he's an ML expert) explained that we can and should use AUC-ROC to compare the runs because that's not sensitive to data balance. After he explained this in depth, I got it and understood. And, I was able to get the AUC data out of the Weka results, which are decent though not spectacular (in the 0.75 neighborhood).

I'm used to IR systems in which you can tune for various metrics, e.g. Precision, F values, MAP, etc. However, as far as I can tell, Weka always trains its classifier models to optimize for % correct. So even though I am interested in another metric, e.g., Precision or F1, I can't for the life of me figure out how to encourage Weka to train its model to focus on optimizing for anything other than % correct (say, F1).

I've combed the weka docs and Googled the heck out of it (incl. site search here on CrossValidated) but couldn't find anything to.

Is that possible? I would really appreciate any insight into whether that's even a possibility at all, is it just not implemented in Weka, or if there's some reason why it shouldn't be done. Or, if there's something I'm missing because I'm calling weka from Java rather than using the GUI.

[1] Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing induction algorithms. In ICML (Vol. 98, pp. 445-453).
http://eecs.wsu.edu/~holder/courses/cse6363/spr04/pubs/Provost98.pdf

Best Answer

My cursory search did not find this option either. As you describe the problem, you want to use:

  1. An imbalanced dataset (85:15).
  2. Random Forest.
  3. ROC and AUC-based loss definitions.
  4. Weka.

Let's try to relax one condition at a time.

Here are some possible alternatives:

  1. Intentionally skew the data: take all the instances from the 15% label and sample a similar number from the other label. Say you have 850 yellow instances and 150 blue, take all the blue instances and sample 150 yellow ones. Then train a random forest using Weka. You can use bootstrap resampling if you want to diversify the data.
  2. Use a cost-sensitive classifier, and mark the cost of false negatives higher. cost-sensitive classification in Weka
  3. Use a different loss function. Like you, I could not find how to do this for the current framework/algorithm combination.
  4. Use a different algorithm. SGD in Weka can use different loss functions.
  5. Use a different ML framework. scikit-learn seems more flexible, but I am unsure whether its implementation of random forest allows for ROC curve-based loss.
Related Question