Solved – Theoretical motivation for using log-likelihood vs likelihood

bayesianlikelihoodprobability

I'm trying to understand at a deeper level the ubiquity of log-likelihood (and perhaps more generally log-probability) in statistics and probability theory. Log-probabilities show up all over the place: we usually work with the log-likelihood for analysis (e.g. for maximization), the Fisher information is defined in terms of the second derivative of the log-likelihood, entropy is an expected log-probability, Kullback-Liebler divergence involves log-probabilities, the expected diviance is an expected log-likelihood, etc.

Now I appreciate the many practical and convenient reasons. Many common and useful pdfs are from exponential families, which leads to elegantly simplified terms when log-transformed. Sums are easier to work with than products (esp for differentiating). Log-probs have a great floating point advantage over straight probs. Log-transforming a pdf often converts a non-concave function into a concave function. But what is the theoretical reason/justification/motivation for log-probs?

As an example of my perplexity, consider the Fisher information (FI). The usual explanation for intuiting the FI is that the second derivative of the log-likelihood tells us how "peaked" the log-likehood is: a highly peaked log-likelihood means the MLE is well-specified and we are relatively sure of its value, while a nearly flat log-likehood (low curvature) means many different parameter values are nearly as good (in terms of the log-likelihood) as the MLE, so our MLE is more uncertain.

This is all well-and-good, but isn't it more natural to just find the curvature of the likelihood function itself (NOT log-transformed)? At first glance the emphasis on the log-transform seems arbitrary and wrong. Surely we are more interested in the curvature of the actual likelihood function. What was Fisher's motivation for working with the score function and the Hessian of the log-likelihood instead?

Is the answer simply that, in the end, we have nice results from the log-likelihood asymptotically? E.g., Cramer-Rao and normality of the MLE/posterior. Or is there a deeper reason?

Best Answer

It's really just a convenience for loglikelihood, nothing more.

I mean the convenience of the sums vs. products: $\ln (\prod_i x_i) =\sum_i\ln x_i$, the sums are easier to deal with in many respects, such as differentialtion or integration. It's not a convenience for only exponential families, I'm trying to say.

When you deal with a random sample, the likelihoods are of the form: $\mathrm{L}=\prod_ip_i$, so the loglikelihood would break this product into the sum instead, which is easier to manipulate and analyze. It helps that all we care is the point of the maximum, the value at the maximum is not important, se we can apply any monotonous transformation such as the logarithm.

On the curvature intuition. It's basically the same thing in the end as the second derivative of loglikelihood.

UPDATE: This is what I meant on the curvature. If you have a function $y=f(x)$, then it's curvature would be (see (14) on Wolfram): $$\kappa=\frac{f''(x)}{(1+f'(x)^2)^{3/2}}$$

The second derivative of the log likelihood: $$A=(\ln f(x))''=\frac{f''(x)}{f(x)}-\left(\frac{f'(x)}{f(x)}\right)^2$$

At the point of the maximum, the first derivative is obviously zero, so we get: $$\kappa_{max}=f''(x_{max})=Af(x_{max})$$ Hence, my quip that the curvature of the likelihood and the second derivative of loglikelihood are the same thing, sort of.

On the other hand, if the first derivative of likelihood is small not only at but around the point of maximum, i.e. the likelihood function is flat then we get: $$\kappa\approx f''(x)\approx A f(x)$$ Now the flat likelihood is not a good thing for us, because it makes finding the maximum more difficult numerically, and the maximum likelihood is not that better than other points around it, i.e. the parameter estimation errors are high.

And again, we still have the curvature and second derivative relationship. So why didn't Fisher look at the curvature of the likelihood function? I think it's for the same reason of convenience. It's easier to manipulate the loglikelihood because of sums instead of the product. So, he could study the curvature of the likelihood by analyzing the second derivative of the loglikelihood. Although the equation looks very simple for the curvature $\kappa_{max}=f''(x_{max})$, in actuality you're taking a second derivative of the product, which is messier than the sum of second derivatives.

UPDATE 2:

Here's a demonstration. I draw a (completely made up) likelihood function, its a) curvature and b) the 2nd derivative of its log. On the left side you see the narrow likelihood and on the right side it's wide. You see how at the point of the max likelihood a) and b) converge, as they should. More importantly though, you can study the width (or flatness) of the likelihood function by examining the 2nd derivative of its log-likelihood. As I wrote earlier the latter is technically simpler than the former to analyze.

Not surprisingly deeper 2nd derivative of loglikelihood signals flatter likelihood function around its max, which is not desired for it causes bigger parameter estimation error.

enter image description here

MATLAB code in case you want to reproduce the plots:

f=@(x,a)a.^2./(a.^2+x.^2);
c = @(x,a)(-2*a.^2.*(a.^2-3*x.^2)./(a.^2+x.^2).^3/(4*a.^4.*x.^2/(a.^2+x.^2).^4+1).^(3/2));
ll2d = @(x,a)(2*(x.^2-a.^2)./(a.^2+x.^2).^2);

h = 0.1;
x=-10:h:10;

% narrow peak
figure
subplot(1,2,1)
a = 1;
y = f(x,a);
plot(x,y,'LineWidth',2)
%dy = diff(y)/h;
hold on
%plot(x(2:end),dy)
plot(x,c(x,a),'LineWidth',2)
plot(x,ll2d(x,a),'LineWidth',2)
title 'Narrow Likelihood'
ylim([-2 1])

% wide peak
subplot(1,2,2)
a=2;
y = f(x,a);
plot(x,y,'LineWidth',2)
%dy = diff(y)/h;
hold on
%plot(x(2:end),dy)
plot(x,c(x,a),'LineWidth',2)
plot(x,ll2d(x,a),'LineWidth',2)
title 'Wide Likelihood'
legend('likelihood','curvature','2nd derivative LogL','location','best')
ylim([-2 1])

UPDATE 3:

In the code above I plugged some arbitrary bell shaped function into the curvature equation, then calculated the second derivative of its log. I didn't re-scale anything, the values are straight from equations to show the equivalence that I mentioned earlier.

Here's the very first paper on likelihood that Fisher published while still in the university, "On an Absolute Criterion for Fitting Frequency Curves", Messenger of Mathmatics, 41: 155-160 (1912)

As I was insisting all along he doesn't mention any "deeper" connections of log probabilities to entropy and other fancy subjects, neither does he offer his information criterion yet. He simply puts the equation $\log P'=\sum_1^n\log p$ on p.54 then proceeds to talk about maximizing the probabilities. In my opinion, this shows that he was using the logarithm just as a convenient method of analyzing the joint probabilities themselves. It is especially useful in the continuous curve fitting, for which he gives an obvious formula on p.55: $$\log P=\int_{-\infty}^\infty\log fdx$$ Good luck analyzing this likelihood (or probability $P$ as per Fisher) without the log!

One thing to note when reading the paper he was only starting with maximum likelihood estimation work, and did more work in subsequent 10 years, so even the term MLE wasn't coined yet, as far as I know.

Related Question