Solved – Advantages of taking the logarithm to minimize the likelihood

gradient descentlogarithmmachine learningmaximum likelihoodoptimization

In regression/classification problem, we are often interested in minimizing a cost function with respect to the parameters of the model. In many cases, the cost function is the negative likelihood. To minimize it, it is standard to minimize the log of it. As the log is monotonically increasing, the two functions have the same minimums, so the final result of the minimization are the same. Taking the log has the advantage of reducing numerical problems as it transforms products into sums but is there any other advantages?

Question: If we were not concerned by any numerical instabilities, would gradient descent work somehow better on minimizing the negative log-likelihood than the negative likelihood?

Obviously the gradient steps are different. If $f$ is the negative likelihood and $g = -\log(-f)$ the negative log-likelihood, the two steps would be:

$$\Delta w = – \lambda \frac{df(w)}{dw}$$

$$\Delta w = \lambda \frac{df(w)}{dw}\times\frac{1}{f(w)}$$

Best Answer

Numerical stability is by far the most important reason for using the log-likelihood instead of the likelihood. That reason alone is more than enough to choose the log-likelihood over the likelihood. Another reason that jumps to mind is that if there is an analytical solution then it is often much easier to find with the log-likelihood.

The likelihood function is typically a product of likelihood contributions by each observation. Taking the derivative of that will quickly lead to an unmanageable number of cross-product terms due to the product rule. In principle it is possible, but I don't want to be the person to keep track of all those terms.

The log-likelihood transforms that product of individual contributions to a sum of contributions, which is much more manageable due to the sum rule.

Related Solutions

Machine Learning – Maximizing Likelihood vs. Minimizing Cost: A Comprehensive Comparison

You already know a lot. Two observations.

Take linear regression. Minimizing the squared error turns out to be equivalent to maximizing the likelihood. Loosely one could say that minimizing the squared error is an intuitive method, and maximizing the likelihood a more formal approach that allows for proofs using properties of for example the normal distribution. The outcomes can overlap.

Second minimizing or maximizing is often AFAIK arbitrary. Minimizing the negative is the same as maximizing the positive. There are a lot of routines that are written in the minimization mode: this is sort of coincidence. For reasons of parsimony/readability this has become standard.

Regression – Applying Newton’s Method for Regression Analysis without Second Derivative

As mentioned in the comments, the reason is that the cost functions mentioned might not have any zeroes at all, in which case Newton's method will fail to find the minima.

I have created a visualization to show this:

As you can see, the method is not converging at all for this particular case.

The code used to create this is stored here.

Adding the relevant portion of the code here itself for convenience:

newton.m:

% Dummy statement to avoid writing function in the first line and making it a 'function file' instead of a 'script file'
1;


% The function to find zeroes of.
% The function is specifically chosen to not have any zeroes
% so as to show the weakness of Newton's method.
function y = f(x)
    y = (x - 5).^2 + 5;
endfunction


% The derivative of f(x)
function y = fd(x)
    y = 2 * (x - 5);
endfunction


% Initial guess
x0 = 1.5;

% Max number of iterations
itermax = 20;

% Epsilon value initialized to a very large value
eps = 1;

% A vector for storing the history of the approximate roots
xvals = x0;

% Number of iterations done
itercount = 0;

% Required for plotting f(x) vs x
x = linspace(0, 10, 100);

% Create a figure whose output is not rendered on the screen
% Not working currently; supposedly a bug in Octave
% A workaround is to use gnuplot instead of qt - `graphics_toolkit gnuplot`
% but this is very slow.
% Uncomment the following to activate the feature once the bug is fixed
% figure('Visible','off');

% The main loop
while eps >= 1e-5 && itercount <= itermax
    % x1 = New value of root
    % x0 = Current value of root
    x1 = x0 - f(x0) / fd(x0);

    % Plot f(x)
    % Plot a line passing through points [x0, f(x0)] and [x1, 0]
    % Plot a line passing through points [x1, 0] and [x1, f(x1)]
    % Plot a line passing through points [x0, 0] and [x0, f(x0)]
    plot(x, f(x), ";f(x);", [x0 x1], [f(x0) 0], "-r;f'(x);", [x1 x1], [0 f(x1)], ":r", [x0 x0], [0 f(x0)], ":r");
    title('f(x) = (x-5)^2 + 5');


    % Set limits for the axes shown in the plots
    xlim([0 10]);
    ylim([0 30]);

    % Label the two consecutive zeroes on the X-axis
    text(x0, -2, sprintf('x%d', itercount), 'color', 'red');
    text(x1, -2, sprintf('x%d', itercount+1), 'color', 'red');

    % Print the plot to a file
    filename = sprintf('output/%05d.jpg', itercount);
    print(filename)

    % Append the zero to the array of zeroes calculated so far
    xvals = [xvals; x1];

    % Calculate the epsilon value
    eps = abs(x1-x0);

    x0 = x1;
    itercount = itercount+1;
end


% Print the result of the iteration
xvals
f_zero = f(xvals(end))
eps
itercount

Best Answer

Related Solutions

Machine Learning – Maximizing Likelihood vs. Minimizing Cost: A Comprehensive Comparison

Regression – Applying Newton’s Method for Regression Analysis without Second Derivative

Related Question