Solved – Likelihood in bayes and likelihood function

likelihoodmaximum likelihood

Bayesian inference is very confusing. I want to clarify about the notation in bayes theorem and the use of the likelihood function(or is it not?) as a part of it.

From my early frequentist days, I thought I understood the concept of maximum likelihood quite clearly

Given some $X$ is distributed such $\theta$ is the parameter(or I like to call it "acting" known value), the notation of pdf can be written as $f(x| \theta)$. This can be expressed as $f$ being a function of $x$. $f(x|\theta)$ can be re-interpreted as $l(\theta|x)$ when we want to find the maximum likelihood. $l(\theta|x)$ can be expressed as a function of $\theta$ as oppose to $x$ and hence, we can do things like differentiating in terms of $\theta$ because now $\theta$ is the variable.
I want to draw a parallel to the likelihood function which is used inside the bayes theorem. Consider bayes theorem as below,

$$\pi(\theta|x)=\frac{l(x|\theta)p(\theta)} {l(x)}$$

The likelihood function in this case is $l(x|\theta)$. My question is:Is the likelihood the same in both cases? If so, why, in the bayes case, do they write $l(x|\theta)$ instead of $l(\theta|x)$?

Another example is when I want to find the fisher information.

$$I(\theta|x)=E_x[\frac{\partial{}}{\partial{\theta}}log(l(x|\theta)) ]^2$$

Like in this case,if we are differentitating wrt to $\theta$, why would you write $l(x|\theta)$ and not $l(\theta|x)$? Also, why would you write a partial derivative if $x$ is known?

I am so confused and my guess is that some mathematical conventions have to thought about carefully.

Best Answer

To answer your first question: Yes - the likelihood function, $l(x \mid \theta)$ is the same regardless of your perspective on whether or not $\theta$ is a random variable or fixed - that is, regardless of whether or not you're a bayesian or frequentist. Likelihood calculations always assume that some fixed parameter of a distribution generated your data, which is random. So you always want to think about it as: "What is the parameter that probably generated the data I'm seeing". Every time you run the experiment you'll get new data and your estimate for that parameter will change. I hope that explains why its not written as $l(\theta \mid x)$. Inferences on $\theta$ given data are posterior distribution ($\pi(\theta \mid x)$) inferences, not likelihood inferences. As you stated:

$$\pi(\theta \mid x) = \frac{l(x \mid \theta)\pi(\theta)}{l(x)}$$

Where $l(x) = \int l(x \mid \theta)d\theta$ is the marginal likelihood. This is the likelihood of the data given $\theta$ evaluated at every possible value of $\theta$. You're effectively removing $\theta$ from your posterior calculation by marginalizing it out. The numerator, on the other hand, is the likelihood of the data given $\theta$ weighted by your prior, $\pi(\theta)$. For more info please browse the site or ask any question you have. There is a lot one can say about the likelihood function.

For the second question, I wouldn't get too tripped up on taking the derivative with respect to a value that you are conditioning on. You aren't conditioning on a specific value for $\theta$ (although this is possible when calculating Observed Fisher's Information). You're simply indicating that $\theta$ is not a random variable - because it's not in this case. That way, when you take expectations, you treat $\theta$ as a constant (even if you don't know what that number is) and you take expectations of $x$, which follows some probability distribution (Binomial, Poisson, ...) with some fixed parameter $\theta$.

Regarding the partial derivative - again, don't worry about the weird notation. It could have been a regular derivative. Often fisher's information is written as the second derivative with respect to $\theta$ (so two derivatives) and it's usually a matrix, so partial derivatives are common notation.

$$I(\theta \mid x)= - \text{E}_x \left [ \frac{\partial{}^2}{\partial{\theta}^2} \log [ l(x \mid \theta) ] \right ]$$

Related Question