Understanding the Likelihood

maximum likelihoodprobabilityprobability distributionsprobability theorystatistics

I am trying to understand the meaning of the Likelihood function or the value of the Likelihood itself. Let us suppose we are tossing coins and we have a sample data given as

$$\hat{x} = \{H,H,T,T,T,T\}$$

For instance in this case the Likelihood Function can be given as

$$L(\theta|\hat{x}) = C(6,2)\theta^2(1-\theta)^4 = 15\theta^2(1-\theta)^4$$

The function looks like this

enter image description here

Now, my question is what does this graph tells us ? We can say that for $\theta = 0.333$, $L(\theta|\hat{x}) $ is maximum.

In general it seems to me that the Likelihood is way of obtaining the best parameters that describes distribution the given data. If we change the parameters that describe the distribution, we change the Likelihood.

We know that for fair coin $\theta_{True} = 0.5$. So my question is does the Likelihood means for this data ($\hat{x}$), the best parameter that describes the fairness of the coin is $0.3333$ (i.e when the Likelihood is maximum) ? We know that if the coin is truly fair, for a large sample we will obtain the true $\theta$ parameter (i.e $N \rightarrow \infty, \theta \rightarrow \theta_{True}$).

In general, I am trying to understand the meaning of the likelihood. Does the Likelihood is a way to obtain the best $\theta$ parameter that describes the distribution for the given data. But, depending on the given data the $\theta$ may not be the true $\theta$ (i.e., $\theta \neq \theta_{True}$ for the given data) .

(For a Gaussian $θ = {µ,σ}$, for a Poisson distribution, $θ = λ$ and for a binomial distribution, $θ = p$, the probability of success in one trial.)

Best Answer

Generally speaking, the aim of statistical inference is to construct a model that best illustrates the property of some data. The true underlying model is generally unknown, or we would not have to do such job at all.

In parametric approach, the likelihood function $L(\theta\mid\vec{x})$ illustrates the probability of $\theta$ given data set $\vec{x}=(x_1,\ldots,x_n)$. For instance, in your settings, you assumed each toss $X_i\sim\operatorname{Bernoulli}(\theta)$. Based on this assumption, the likelihood function is constructed as follows: $$L(\theta\mid\vec{x})=\prod_{i=1}^{n}{\mathbb{P}(x_i\mid\theta)}=\theta^{x_1+\cdots+x_n}(1-\theta)^{n-(x_1+\cdots+x_n)}.$$ In your data set, the likelihood function should be $$L(\theta\mid\vec{x})=\theta^2(1-\theta)^4.$$ Then based on this likelihood function, we want to find the value of $\theta$, say $\hat{\theta}$ that maximizes it, which is precisely defined as the MLE of $\theta$. This principle follows from the intuition that the larger the $L(\theta\mid\vec{x})$ is, the most possible such $\theta$ fits the data. By some calculations (see details there), the MLE of $\theta$ is given by $$\hat{\theta}=\bar{x}=\frac{x_1+\cdots+x_n}{n}.$$ In your example, the value is $\hat{\theta}=2/6=1/3$. This means that based on the data set $\{H,H,T,T,T,T\}$ and the Bernoulli assumption, the most plausible estimate for $\theta$ is $1/3$. That is all about this inference procedure.

However, the actual parameter may not be $1/3$. Tossing a fair coin ($\theta=1/2$) can also yield this sequence. However, as you can see, in the likelihood function $L(\theta\mid\vec{x})=\theta^2(1-\theta)^4$, $$L(1/2\mid\vec{x})<L(1/3\mid\vec{x}).$$ We would naturally prefer $1/3$ to $1/2$, because we knew nothing about the true parameter in the inference. Then if the coin is fair, we can only say that the MLE is not accurate and that is all. In other words, the MLE is not always the best choice for us.

Let me demonstrate this again. The true parameter has nothing to do with this inference, so it makes no sense by plugging in the true parameter in the likelihood function and comparing it with other terms. If you know the true parameter, why bother inferring it?

That is why I said you were wrong: "If we change the parameters that describe the distribution, we change the Likelihood." Referring to the setup above, the likelihood function does not change unless

i) data set changes or

ii) the model assumption (i.e., the underlying distribution) is altered.

As for the function value of the likelihood function, we only care about its maximizer.

Related Question