Solved – Advantages of taking the logarithm to minimize the likelihood

gradient descentlogarithmmachine learningmaximum likelihoodoptimization

In regression/classification problem, we are often interested in minimizing a cost function with respect to the parameters of the model. In many cases, the cost function is the negative likelihood. To minimize it, it is standard to minimize the log of it. As the log is monotonically increasing, the two functions have the same minimums, so the final result of the minimization are the same. Taking the log has the advantage of reducing numerical problems as it transforms products into sums but is there any other advantages?

Question: If we were not concerned by any numerical instabilities, would gradient descent work somehow better on minimizing the negative log-likelihood than the negative likelihood?

Obviously the gradient steps are different. If $f$ is the negative likelihood and $g = -\log(-f)$ the negative log-likelihood, the two steps would be:

$$\Delta w = – \lambda \frac{df(w)}{dw}$$

$$\Delta w = \lambda \frac{df(w)}{dw}\times\frac{1}{f(w)}$$

Best Answer

Numerical stability is by far the most important reason for using the log-likelihood instead of the likelihood. That reason alone is more than enough to choose the log-likelihood over the likelihood. Another reason that jumps to mind is that if there is an analytical solution then it is often much easier to find with the log-likelihood.

The likelihood function is typically a product of likelihood contributions by each observation. Taking the derivative of that will quickly lead to an unmanageable number of cross-product terms due to the product rule. In principle it is possible, but I don't want to be the person to keep track of all those terms.

The log-likelihood transforms that product of individual contributions to a sum of contributions, which is much more manageable due to the sum rule.

Related Question