Solved – Confusion about concept of likelihood vs. probability

definitionlikelihoodprobability

I've been recently trying to wrap my head around the concept of likelihood, and have made some good progress, but there is one thing that is bugging me, and I think this issue is what makes the concept so tricky to grasp for others.

I've read a few threads here, in particular this one and this one.

I'll first verify my understanding of the concept (ignoring the distinction between discrete and continuous cases), and then attempt to articulate my issue/confusion:

Most of the discussion about likelihood occurs in the context of a random variable $X$, whose distribution is modeled by a function, with a set of parameters ($\theta$).

We can then conceptualize the conditional probability of any particular set of observations $x$, as $P(x|\theta)$.

So far, so good.

Here's where the crucial move occurs:

Likelihood is introduced as a concept that is conditional upon $x$, and is defined as $L(\theta|x) = P(x|\theta)$, and is thus a function of $\theta$.

There is logically nothing wrong with this move. Likelihood is simply an "inverse" concept with respect to conditional probability.

However, there seems to be something of a disingenuous sleight of hand here: on a purely colloquial level, likelihood, i.e. how likely something is, is about as far away from an inverse concept of probability (i.e. how probable something is), as can be. They're synonyms!

Let's see how this plays out if we are to take the definition of likelihood seriously, and I will not frame the following example in terms of random variables and parameters, but rather simply in terms of probabilities and conditions (after all, it seems to me that the fundamental meaning of likelihood is in how it inverts the probability and the conditional).

Suppose that, on any given day, there is a 20% chance that Graham uses an umbrella if it rains. That is, given the condition of rain, the probability of Graham using an umbrella is 0.2.

Now suppose, on a particular day, we observe Graham using an umbrella. According to the above definition, the likelihood of there being rain, given our observation, is 0.2.

Importantly, even if it rains everyday, the likelihood of there being rain, given the observation of an umbrella, is still 0.2, since it's equivalent to the probability of Graham using an umbrella given the event of rain.

It seems to be an abuse of language to refer to the likelihood of rain to be 0.2, as the mind almost inevitably interprets this to be "it is unlikely to be raining, given Graham's umbrella usage" (which indeed is the inverse fallacy).

Thylacoleo gave an interesting definition of likelihood:

Likelihood is a measure of the extent to which a sample provides support for particular values of a parameter in a parametric model.

In the rain and umbrella example, this would mean that, on a scale from 0 to 10, the fact that Graham is using an umbrella provides about 2 units of support for the hypothesis that it is raining.

But if you really think about it, and cut through the semantic acrobatics here, this seems equivalent to saying:

Given that Graham is using an umbrella, there is a 20% chance that it is raining.

And that is of course the inverse fallacy.

I have a feeling I'm missing something. Perhaps it has to do with the fact that I haven't defined the probability of Graham using an umbrella when it doesn't rain, and this missing information is responsible for the apparent paradox that arises when I try to use Thylacoleo's definition.

Let's see what happens when I try to fill that gap.

Suppose there is a 20% chance of Graham using an umbrella if it rains, and a 0% chance of him using an umbrella if it doesn't rain.

We observe him using an umbrella, and, again, state that the likelihood of there being rain, given this observation, is 0.2.

Applying Thylacoleo's definition:

On a scale from 0 to 10, the fact that Graham is using an umbrella provides about 2 units of support for the hypothesis that it is raining.

Again, isn't the above paragraph simply the inverse fallacy in paraphrased plain english?

Best Answer

As @glen_b pointed out, likelihood is not an inverse probability, as $\theta$ is not a random variable. However, you are correct in that it is a measure of evidential support. One caveat is that, unlike probability, it is not an absolute measure of support (a likelihood of 1, 10, or 1000 has no intrinsic meaning), but a relative measure of support. Generally, this is encoded by forming the likelihood ratio (LR):

$$ LR(\theta):= \frac{L(\theta;x)}{L(\theta_{MLE};x)}$$

Which will always be between 0 and 1. This is an improvement over the unnormalized likelihood, but we still aren't quite there. It turns out that, for example, a $LR=0.15$ is not by itself a useful measure either, since its interpretation depends on the dimension of $\theta$. If $\theta$ is a scalar, then $P(LR<0.15) \xrightarrow{n} 0.05$, so it can be used in a probabilistic framework in much the same way as any other test statistic.

However, it can also be used as a purely subjective measure of what we consider "plausible" parameter given the data (read:evidence). Under this non-probabilistic interpretation, we would say that any scalar $\theta$ that resulted in $LR<0.15$ would be "implausible" or "unlikely". Now, what if we wanted to port this same subjective assessment to a vector parameter, say $(\theta_1,\theta_2)$? Unfortunately, we cannot continue to use $0.15$ as our cutoff for "unlikely" (well, of course you can, but then your inferences at a higher dimension will not be compatible with inferences at a lower dimension. This is a subtle point. A good article on this was written by one of the strongest proponents of likelihood inference (JK Lindsey). See here.). Essentially, compatible inference can be implemented by raising the scalar likelihood cutoff to the number of dimensions of the vector parameter. For example, if our parameter dimension is 2, then a cutoff that would be compatible with $0.15$ would be $0.15^2$.


The above is a very abridged description of modern likelihood. I think your confusion is shown by the following statement you made:

Given that Graham is using an umbrella, there is a 20% chance that it is raining.

This is actually not what a 20% likelihood would tell you. What you stated above is a Bayesian posterior probability: $P(\textrm{Raining}|\textrm{Umbrella})$, what the likelihood it saying is quite the opposite:

$$L(\textrm{Raining}|\textrm{Umbrella}) = P(\textrm{Umbrella}|\textrm{Raining})$$

As you correctly pointed out, a prior probability (and a normalizing constant) is required to turn a likelihood into a probability.

Related Question