Machine Learning – Why Use Loss Functions to Estimate Models Instead of Accuracy Metrics?

loss-functionsmachine learningmetric

When building a learning algorithm we are looking to maximize a given evaluation metric (say accuracy), but the algorithm will try to optimize a different loss function during learning (say MSE/entropy).

Why are the evaluation metrics not used as loss functions for the learning algorithm then? Won't we then be optimizing the same metric that we are interested in?

Is there something I am missing?

Best Answer

It's a good question. Generally, I would argue that you should try to optimise a loss function which corresponds to the evaluation metric you care most about.

You might however want to know about other evaluation metrics.

For example, when doing classification, I'm of the opinion that you would need to give me a pretty good reason to not be optimising the cross-entropy. That said, the cross-entropy is not a very intuitive metric, so you might, once you've finished training, also want to know how good your classification accuracy is, to get a feel for whether your model is actually going to be of any real world use (it might be the best possible model and have a better cross-entropy than everybody else's, but still have insufficient accuracy to be of use in the real world).

Another argument I'm less familiar with, is, mainly in tree-based (or other greedy) algorithms, whether using certain losses mean you make better splits early on and allow you to better optimise the metric you care about globally. For example, people tend to use Gini or Information Entropy (note, not cross-entropy) when deciding on what the best split in a decision tree is. The only arguments I've ever heard for this, are not very convincing, and are basically arguments for not using accuracy but using cross-entropy instead (things around class imbalance maybe). I can think of two reasons you might use Gini when trying to get the best cross-entropy:

Something to do with local learning and greedy decision-making, as alluded to above (not convinced by this I must add).
Something to do with the actual computational implementation. In theory, a decision tree evaluates every possible split at every node and finds the best according to your criterion, but in reality, as I understand it, it does not do this and uses approximate algorithms, which I suspect leverage properties of your loss criterion.

In summary, the main reason you would have multiple evaluation metrics, is to understand what your model is doing. There might be reasons related to finding the best solution by approximate methods which mean you want to maximise metric A in order to get a solution which comes close to maximising metric B.

Related Solutions

Regression – Importance of Optimizing the Correct Loss Function Explored

Say that I am building a linear regression model p for predicting some values $y_1,…,y_n$.

If the data contains a few extreme outliers in the response - or even just one - the MSE fitted equation can be pulled arbitrarily far away from the MAE one.

Consider the simplest regression model (just an intercept, $\alpha$), and following data:

  0.0003 0.0001 0.0002 0.0004 50000 0.0002 0.0004 0.0003 0.0001 0.0003

The MAD solution is $\alpha$ = 0.0003. The MSE solution is 5000.00023.

The MAD of the minimum MAD solution is about 0.0001. The MAD of the minimum MSE solutions is about 5000. You can potentially do very badly, if you use MSE when the criterion is MAD.

Solved – MSE as a proxy to Pearson’s Correlation in Regression Problems

This is a good question and unfortunately unanswered for a long time, it seems that there was a partial answer given just a couple months after you asked this question here that basically just argues that correlation is useful when the outputs are very noisy and perhaps MSE otherwise. I think first of all we should look at the formulas for both.

$$MSE(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^n(y_i - \hat{y_i})^2$$ $$R(y, \hat{y}) = \frac{\sum_{i=1}^n (y_i - \bar{y})(\hat{y_i} - \hat{\bar{y}})} {\sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2} \sqrt{\sum ^n _{i=1}(\hat{y_i} - \hat{\bar{y}})^2}} $$

Some few things to note, in the case of linear regression we know that $\hat{\bar{y}} = \bar{y}$ because of unbiasedness of the regressor, so the model will simplify a little bit, but in general we can't make this assumption about ML algorithms. Perhaps more broadly it is interesting to think of the scatter plot in $\mathbb{R^2}$ of $ \{ y_i, \hat{y_i}\} $ correlation tells us how strong the linear relationship is between the two in this plot, and MSE tells us how far from the diagonal each point is. Looking at the counter examples on the wikipedia page you can see there are many relationships between the two that won't be represented.

I think generally correlation is tells similar things as $R^2$ but with directionality, so correlation is somewhat more descriptive in that case. In another interpretation, $R^2$ doesn't rely on the linearity assumption and merely tells us the percentage of variation in $y$ that's explained by our model. In other words, it compares the model's prediction to the naive prediction of guessing the mean for every point. The formula for $R^2$ is:

$$R^2(y,\hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i-\hat{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}$$

So how does $R$ compare to $R^2$? Well it turns out that $R$ is more immune to scaling up of one of the inputs this has to do with the fact that $R^2$ is homogenous of degree 0 only in both inputs, where $R$ is homogenous of degree 0 in either input. It's a little less clear what this might imply in terms of machine learning, but it might mean that the model class of $\hat{y}$ can be a bit more flexible under correlation. This said, under some additional assumptions, however, the two measures are equal, and you can read more about it here: http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/.

Finally a last important thing to note is that $R$ and $R^2$ do not measure goodness of fit around the $y=\hat{y}$ line. It is possible (although odd) to have a predictor be linearly shifted away from the $y=\hat{y}$ line with an $R^2$ of one, but the predictions would still be "bad". In this case, MSE would be more informative in finding the better predictor than $R^2$ but perhaps this is more of a pathological case than an issue with using $R$ and $R^2$ as metrics.

Best Answer

Related Solutions

Regression – Importance of Optimizing the Correct Loss Function Explored

Solved – MSE as a proxy to Pearson’s Correlation in Regression Problems

Related Question