You already know a lot. Two observations.
Take linear regression. Minimizing the squared error turns out to be equivalent to maximizing the likelihood. Loosely one could say that minimizing the squared error is an intuitive method, and maximizing the likelihood a more formal approach that allows for proofs using properties of for example the normal distribution. The outcomes can overlap.
Second minimizing or maximizing is often AFAIK arbitrary. Minimizing the negative is the same as maximizing the positive. There are a lot of routines that are written in the minimization mode: this is sort of coincidence. For reasons of parsimony/readability this has become standard.
As mentioned in the comments, the reason is that the cost functions mentioned might not have any zeroes at all, in which case Newton's method will fail to find the minima.
I have created a visualization to show this:
As you can see, the method is not converging at all for this particular case.
The code used to create this is stored here.
Adding the relevant portion of the code here itself for convenience:
newton.m:
% Dummy statement to avoid writing function in the first line and making it a 'function file' instead of a 'script file'
1;
% The function to find zeroes of.
% The function is specifically chosen to not have any zeroes
% so as to show the weakness of Newton's method.
function y = f(x)
y = (x - 5).^2 + 5;
endfunction
% The derivative of f(x)
function y = fd(x)
y = 2 * (x - 5);
endfunction
% Initial guess
x0 = 1.5;
% Max number of iterations
itermax = 20;
% Epsilon value initialized to a very large value
eps = 1;
% A vector for storing the history of the approximate roots
xvals = x0;
% Number of iterations done
itercount = 0;
% Required for plotting f(x) vs x
x = linspace(0, 10, 100);
% Create a figure whose output is not rendered on the screen
% Not working currently; supposedly a bug in Octave
% A workaround is to use gnuplot instead of qt - `graphics_toolkit gnuplot`
% but this is very slow.
% Uncomment the following to activate the feature once the bug is fixed
% figure('Visible','off');
% The main loop
while eps >= 1e-5 && itercount <= itermax
% x1 = New value of root
% x0 = Current value of root
x1 = x0 - f(x0) / fd(x0);
% Plot f(x)
% Plot a line passing through points [x0, f(x0)] and [x1, 0]
% Plot a line passing through points [x1, 0] and [x1, f(x1)]
% Plot a line passing through points [x0, 0] and [x0, f(x0)]
plot(x, f(x), ";f(x);", [x0 x1], [f(x0) 0], "-r;f'(x);", [x1 x1], [0 f(x1)], ":r", [x0 x0], [0 f(x0)], ":r");
title('f(x) = (x-5)^2 + 5');
% Set limits for the axes shown in the plots
xlim([0 10]);
ylim([0 30]);
% Label the two consecutive zeroes on the X-axis
text(x0, -2, sprintf('x%d', itercount), 'color', 'red');
text(x1, -2, sprintf('x%d', itercount+1), 'color', 'red');
% Print the plot to a file
filename = sprintf('output/%05d.jpg', itercount);
print(filename)
% Append the zero to the array of zeroes calculated so far
xvals = [xvals; x1];
% Calculate the epsilon value
eps = abs(x1-x0);
x0 = x1;
itercount = itercount+1;
end
% Print the result of the iteration
xvals
f_zero = f(xvals(end))
eps
itercount
Best Answer
Numerical stability is by far the most important reason for using the log-likelihood instead of the likelihood. That reason alone is more than enough to choose the log-likelihood over the likelihood. Another reason that jumps to mind is that if there is an analytical solution then it is often much easier to find with the log-likelihood.
The likelihood function is typically a product of likelihood contributions by each observation. Taking the derivative of that will quickly lead to an unmanageable number of cross-product terms due to the product rule. In principle it is possible, but I don't want to be the person to keep track of all those terms.
The log-likelihood transforms that product of individual contributions to a sum of contributions, which is much more manageable due to the sum rule.