My cursory search did not find this option either.
As you describe the problem, you want to use:
- An imbalanced dataset (85:15).
- Random Forest.
- ROC and AUC-based loss definitions.
- Weka.
Let's try to relax one condition at a time.
Here are some possible alternatives:
- Intentionally skew the data: take all the instances from the 15%
label and sample a similar number from the other label. Say you have
850 yellow instances and 150 blue, take all the blue instances and
sample 150 yellow ones. Then train a random forest using Weka. You can use bootstrap resampling if you want to diversify the data.
- Use a cost-sensitive classifier, and mark the cost of false negatives higher. cost-sensitive classification in Weka
- Use a different loss function. Like you, I could not find how to do this for the current framework/algorithm combination.
- Use a different algorithm. SGD in Weka can use different loss functions.
- Use a different ML framework. scikit-learn seems more flexible, but I am unsure whether its implementation of random forest allows for ROC curve-based loss.
There are actually multiple ways to do this.
Remember that the AUC is a normalized form of the Mann-Whitney-U statistic, that is, the sum of ranks in either of the class. This means that finding optimal AUC is the problem of ordering all scores $s_1,\ldots,s_N$ so that the scores are higher in one class than the other.
This can be framed for example as a highly infeasible linear-programming-problem which can be solved heuristically with appropriate relaxations, but one method that interests me more is to find approximate gradients to the AUC so that we can optimize with stochastic-gradient-descent.
There's plenty to read about this, here is a naive approach:
Using '$[]$' as the Iverson-Bracket, another way to look at the sought ordering over the scores could be that
$[s_i\leq s_j]=1$ for all $i,j$ where response $y_i=0$ and $y_j=1$
So if the scores are a function of inputs and parameters $s_i = f(x_i,\theta)$
We want to maximize
$$M^*=max_\theta \sum_{i,j}[s_i\leq s_j]$$
Consider the relaxation $\tanh(\alpha(s_j-s_i)) \leq [s_i\leq s_j]$
So $$M^*\geq \sum_{i,j}\tanh(\alpha(s_j-s_i))$$
And we could then sample $i$ and $j$ from each class to get contributions $\nabla_\theta \tanh(\alpha(s_j-s_i))$ to the full gradient.
Best Answer
As you mention, AUC is a rank statistic (i.e. scale invariant) & log loss is a calibration statistic. One may trivially construct a model which has the same AUC but fails to minimize log loss w.r.t. some other model by scaling the predicted values. Consider:
So, we cannot say that a model maximizing AUC means minimized log loss. Whether a model minimizing log loss corresponds to maximized AUC will rely heavily on the context; class separability, model bias, etc. In practice, one might consider a weak relationship, but in general they are simply different objectives. Consider the following example which grows the class separability (effect size of our predictor):