We can write Bayes' theorem as
$$p(\theta|x) = \frac{f(X|\theta)p(\theta)}{\int_{\theta} f(X|\theta)p(\theta)d\theta}$$
where $p(\theta|x)$ is the posterior, $f(X|\theta)$ is the conditional distribution, and $p(\theta)$ is the prior.
or
$$p(\theta|x) = \frac{L(\theta|x)p(\theta)}{\int_{\theta} L(\theta|x)p(\theta)d\theta}$$
where $p(\theta|x)$ is the posterior, $L(\theta|x)$ is the likelihood function, and $p(\theta)$ is the prior.
My question is
- Why is Bayesian analysis done using the likelihood function and not the conditional distribution?
- Can you say in words what the difference between the likelihood and conditional distribution is? I know the likelihood is not a probability distribution and $L(\theta|x) \propto f(X|\theta)$.
Best Answer
Suppose that you have $X_1,\dots,X_n$ random variables (whose values will be observed in your experiment) that are conditionally independent, given that $\Theta=\theta$, with conditional densities $f_{X_i\mid\Theta}(\,\cdot\mid\theta)$, for $i=1,\dots,n$. This is your (postulated) statistical (conditional) model, and the conditional densities express, for each possible value $\theta$ of the (random) parameter $\Theta$, your uncertainty about the values of the $X_i$'s, before you have access to any real data. With the help of the conditional densities you can, for example, compute conditional probabilities like $$ P\{X_1\in B_1,\dots,X_n\in B_n\mid \Theta=\theta\} = \int_{B_1\times\dots\times B_n} \prod_{i=1}^n f_{X_i\mid\Theta}(x_i\mid\theta)\,dx_1\dots dx_n \, , $$ for each $\theta$.
After you have access to an actual sample $(x_1,\dots,x_n)$ of values (realizations) of the $X_i$'s that have been observed in one run of your experiment, the situation changes: there is no longer uncertainty about the observables $X_1,\dots,X_n$. Suppose that the random $\Theta$ assumes values in some parameter space $\Pi$. Now, you define, for those known (fixed) values $(x_1,\dots,x_n)$ a function $$ L_{x_1,\dots,x_n} : \Pi \to \mathbb{R} \, $$ by $$ L_{x_1,\dots,x_n}(\theta)=\prod_{i=1}^n f_{X_i\mid\Theta}(x_i\mid\theta) \, . $$ Note that $L_{x_1,\dots,x_n}$, known as the "likelihood function" is a function of $\theta$. In this "after you have data" situation, the likelihood $L_{x_1,\dots,x_n}$ contains, for the particular conditional model that we are considering, all the information about the parameter $\Theta$ contained in this particular sample $(x_1,\dots,x_n)$. In fact, it happens that $L_{x_1,\dots,x_n}$ is a sufficient statistic for $\Theta$.
Answering your question, to understand the differences between the concepts of conditional density and likelihood, keep in mind their mathematical definitions (which are clearly different: they are different mathematical objects, with different properties), and also remember that conditional density is a "pre-sample" object/concept, while the likelihood is an "after-sample" one. I hope that all this also help you to answer why Bayesian inference (using your way of putting it, which I don't think is ideal) is done "using the likelihood function and not the conditional distribution": the goal of Bayesian inference is to compute the posterior distribution, and to do so we condition on the observed (known) data.