On face value I would recommend using patternnet
as it gives you better out of sample performance; the results from newff
seems suspiciously good leading me to believe some over-fitting occurs. On that matter check the following link: Improve Neural Network Generalization and Avoid Overfitting.
To comment on the different results: For newff
a Levenberg-Marquardt backpropagation is utilized while for patternnet
, scaled conjugate gradient backpropagation. In general, different optimization procedures are not guaranteed to arrive in the same result even if they had the target function to optimize against. In your case through you are also using different target functions (mse
and crossentropy
respectively). It would probably be alarming you if did got the same results as you are fitting different criteria. :)
Having said that, using newff
seems a bit odd. It is considered obsolete since R2010b and you are recommend (by the docs) to use feedforwardnet
. Try using feeforwardnet
first and then decide on which procedure you will ultimately use. As it stands it seems like you comparing the performance of a function (newff
) people have not worked on for at least 4 years (if not more) against the performance of a function (patternnet
) that is actively developed. It is not really surprising that the latter one it does a better job.
precision-recall curve is considered better than an ROC curve when testing a classifier on a dataset with a class imbalance.
This statement is plain wrong.
In general, statements like "X is better than Y" should be taken with a grain of salt. It usually depend on the use case, what is your target, etc. However, the statement above is more wrong than that. Let's take a look.
The PR curves plots the following parameters:
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Notice how True Negatives (TN) are absent from the equation?
PR curves are useful when positive examples are rare. If your dataset is imbalanced with rare negatives, you should absolutely not use the PR curve.
As you noticed in your experiment, and as you have correctly reasoned,
ROC curves are insensitive to class imbalance. This means that if you balance your data (ie with resampling), the ROC curve does not change (assuming you don't re-train your model).
When the dataset contains only few positive examples, you have a new problem to care about: positive predictive value (= PPV = Precision). Specifically: given an observation is classified as positive, what is the probability that it is really a True Positive? The answer to this question can be surprisingly misleading when positive examples are rare.
PPV and NPV (the complement: given an observation is negative, what is the probability to be a true negative) are usually not an issue in balanced datasets, as they follow the usual sensitivity and specificity. PPV and NPV only become critical in imbalanced datasets, because these two measures are sensitive to class imbalance, unlike ROC curves. So ROC curves can obscure models with poor PPV and NPV, which can be an issue in the case of imbalance. PR curves will immediately highlight models with poor PPV and totally disregard NPV.
So in the end it is up to you to choose which tool to use. Don't use PR curves just because you have imbalance. Are positive examples rare? Are they rare specifically in your sample? Then stick with ROC curves. Are they rare in the general population? Then you should consider precision and look at the PR curve too. Are negative cases rare? Then stick with ROC curve.
Best Answer
"G-mean" in itself does not refer to something other than the result of: $g=\sqrt{x\cdot y}$ when dealing with two variables $x$ and $y$. Therefore, unless formally defined I would be careful to interpreter what a particular author refers at.
That said,
imbalanced-learn
'sgeometric_mean_score()
does the right calculation based on the reference they used. Kubat & Matwin (1997) Addressing the curse of imbalanced training sets: one-sided selection define the geometric mean $g$ based on the "accuracy on positive examples" and "accuracy on negative examples", which are defined respectively as the metrics Sensitivity (True Positive Rate - TPR) and Specificity (True Negative Rate - TNR). Therefore, thegeometric_mean_score()
function is correct; it reproduces the methodology presented by the references it cites.Sensitivity and Specificity are informative metrics on how likely are we to detect instances from the Positive and Negative class respectively from our hold-out test sample. In that sense, Specificity is essentially our Sensitivity of detecting Negative class examples. This is further emphasised when looking at the multi-class version of the G-mean where we compute the $n$-th root of the product of Sensitivity for each class. In the case where $n=2$ and assuming we have classes
A
andB
with the classA
as the "Positive" one and classB
as the "Negative" one, the Sensitivity of class B is just the Specificity in the binary classification. In the case where $n>2$, we cannot refer to "Positive" and "Negative" class (aside the context of one-vs-rest classification) so we just use the product of the per class Sensitivity score, i.e. $\sqrt[n]{x_1 \cdot x_2 \cdot \dots \cdot x_n }$ where $x_i$ here refers to the Recall score from the $i$-th class.Let me stress that, Sensitivity and Specificity are metrics that dichotomise our output and should be, at first instance, avoided when optimising classifier performance. A more detailed discussion as to why metrics like Sensitivity and Accuracy that inherently dichotomise our outputs are often suboptimal can be found here: Why is accuracy not the best measure for assessing classification models?
Further commentary: I think there is some confusion on how this "g-mean" is defined, stems from the fact that the $F_1$ score is defined in terms of Precision (Positive Predictive Value - PPV) and Recall (TPR) and is the harmonic mean ($h = \frac{2 \cdot x \cdot y}{x+y}$) of the two. Some people might use the geometric mean $g$ instead of the harmonic mean $h$ thinking it is just another reformulation without realising that they redefining an existing metric. Please note that the geometric mean of Precision and Recall is not inherently wrong; just it is not what F-scores refer at nor what the papers cited by
imbalanced-learn
use.