Solved – Deep Neural Networks: AUCROC Values Consistently = 0.5 even though RMS Error on Test Set ~10%

neural networksroc

I am new to Neural Networks and but I have built a multi-classifier using the FANN neural network package.

My multi-classifier, regardless of the network hyperparameters, consistently gives an error around 10% (it changes based on the network configuration only 2-3 d.p. after the 10%, and these changes are completely undeterministic) on my test set. I am using k-fold cross validation with about 5000 events in total and about 5 folds.

Furthermore, I have built functionality for creating ROC curves, one for each class, assuming if that class' NN output value > some threshold T then it was predicted to be an example of that class and no otherwise. For some reason, I am almost consistently getting 0.5 AUCROC for each class' ROC curve. Sometimes, I get a slightly bigger value for some class with some set of network hyperparameters (never above 0.6 though) but again its undeterministic because if I run the network again with the same hyperparameters I get different AUCROC values for the different classes.

What exactly does this mean?

In obtaining my data, I am applying a pre-selection for obtaining samples (making my total samples go from around 50,000 to 5000) (e.g. dimension x > 5). Could it be that the classifier is having trouble classifying samples that pass the preselection?

Thanks.

Best Answer

This is almost certainly because of how you've mis-defined AUROC. Fixing a threshold and then doing performance assessment on the basis of the threshold is an improper scoring method because it only uses partial information about the data. I don't know about multi-class AUROC, but that's the literature you'll want to read to make progress. But methods like cross-entropy and Brier generalize more readily to multi-class cases, so you have options there as well. There are variations of multi-class ROC that are insensitive to class imbalance.

Proper vs improper scoring rules are discussed all over our archives, so you can find those discussions using the [search] feature. Frank Harrell's posts are particularly good. You should always use full information about model outputs, i.e. do not threshold the model outputs. Thresholds create vexing scenarios where the decision rule $\hat{y}>0.5$ treats $\hat{y_1}=0.99$ the same as $\hat{y_2}=0.51$ even though the model is very obviously much more confident about the first instance.