Machine Learning – Why Use Loss Functions to Estimate Models Instead of Accuracy Metrics?

loss-functionsmachine learningmetric

When building a learning algorithm we are looking to maximize a given evaluation metric (say accuracy), but the algorithm will try to optimize a different loss function during learning (say MSE/entropy).

Why are the evaluation metrics not used as loss functions for the learning algorithm then? Won't we then be optimizing the same metric that we are interested in?

Is there something I am missing?

Best Answer

It's a good question. Generally, I would argue that you should try to optimise a loss function which corresponds to the evaluation metric you care most about.

You might however want to know about other evaluation metrics.

For example, when doing classification, I'm of the opinion that you would need to give me a pretty good reason to not be optimising the cross-entropy. That said, the cross-entropy is not a very intuitive metric, so you might, once you've finished training, also want to know how good your classification accuracy is, to get a feel for whether your model is actually going to be of any real world use (it might be the best possible model and have a better cross-entropy than everybody else's, but still have insufficient accuracy to be of use in the real world).

Another argument I'm less familiar with, is, mainly in tree-based (or other greedy) algorithms, whether using certain losses mean you make better splits early on and allow you to better optimise the metric you care about globally. For example, people tend to use Gini or Information Entropy (note, not cross-entropy) when deciding on what the best split in a decision tree is. The only arguments I've ever heard for this, are not very convincing, and are basically arguments for not using accuracy but using cross-entropy instead (things around class imbalance maybe). I can think of two reasons you might use Gini when trying to get the best cross-entropy:

  1. Something to do with local learning and greedy decision-making, as alluded to above (not convinced by this I must add).

  2. Something to do with the actual computational implementation. In theory, a decision tree evaluates every possible split at every node and finds the best according to your criterion, but in reality, as I understand it, it does not do this and uses approximate algorithms, which I suspect leverage properties of your loss criterion.

In summary, the main reason you would have multiple evaluation metrics, is to understand what your model is doing. There might be reasons related to finding the best solution by approximate methods which mean you want to maximise metric A in order to get a solution which comes close to maximising metric B.