Solved – Reporting of Neural Network Accuracy for Academic Publications

academiaclassificationconv-neural-networkmachine learningneural networks

I'm an academic researcher, working with Convolutional Neural Networks, particularly for image classification. In academic publications, a typical metric for evaluating the performance of a recognition pipeline is the classification accuracy. What I am wondering, is exactly at what point during the training stage this number is taken.

For example, in my experiments, I train the network with back propagation, and reduce the learning rate over time. For this, I observe the testing accuracy, and reduce the rate by a certain amount whenever this accuracy is no longer increasing. However, what I notice is that once the system has converged, and I continue to train with minibatches, the overall testing accuracy still varies after each minibatch by around 1%, even though the average testing error over all minibatches is constant.

So, my questions are:

  1. When reporting the testing accuracy in an academic publication, is it acceptable to simply take the highest accuracy over all these minibatches? Or should something more representative be used, such as an average over all minibatches?

  2. Sometimes, the testing accuracy actually begins to fall as further training is carried out, due to overfitting. Is it acceptable to stop the training at this point, and report this peak testing accuracy, even though the testing dataset is distinct from the training dataset (i.e. the validation set is not a subset of the training dataset)?

Best Answer

Before I answer your question let me say, in general it is considered good practice to have a Validation set that is completely distinct from your Test set. You may or may not be aware of this fact, but you seem to gloss over it in your question, so I wanted to make that explicitly clear.

The answer to your question would be an average across all mini-batches of your Test set, as the Test set is supposed to represent an unbiased representation of how that NN may perform in the wild. Another way of stating this would be that the test set (since you haven't tuned your hyper-parameters to do well on that set) should represent how well your network generalizes, which is the goal for any network.

It would likely be considered deceptive academic practice to cherry-pick the best mini-batch for publishing results, and generally any published results should be able to be reproduced by other researchers. If you artificially conflate your accuracy by choosing the best mini-batch, that would make replication of your results difficult for other researchers and would likely lead other researchers to question the validity of your claims.

Examples of this can be seen in other areas of Science: many researchers have swiftly ruined their careers by publishing results that are not able to be reproduced by other researchers.

While I can't explicitly tell you whether or not this is typical of academic publications in this field of study, it is definitely unethical, probably immoral, and, as you have already stated yourself, not good practice.

It is likely that as this field grows in maturity, more attention will be paid to replication of results, and in doing so, will make the ability to replicate results vital to not only the integrity of the researcher, but also whether or not the research is generally accepted.

Please note that reproducability is a standard in all Scientific endeavors, and Computer Scientists are not the only ones facing this issue. For instance, I just ran across this BBC article about reproducability. This article (tangentially) addresses some of the points you have brought up with this question. This article also addresses the issue of, and need for, reproducability in general.