I understand that likelihood differs from a probability distribution because likelihood describes the probability of certain parameter values given the data that you've observed (it's essentially a distribution that describes observed data) while a probability distribution describes the probability of observing certain values given constant parameter values. But what is a marginal likelihood and how does it relate to posterior distributions? (preferably explained without, or with as little as possible, probability notation so that the explanation is more intuitive). Any examples would be great as well.
Solved – In the most basic sense, what is marginal likelihood
bayesianlikelihoodprobability
Related Solutions
The answer depends on whether you are dealing with discrete or continuous random variables. So, I will split my answer accordingly. I will assume that you want some technical details and not necessarily an explanation in plain English.
Discrete Random Variables
Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is $p$ and that coin tosses are independent).
Denote the observed outcomes by $O$ and the set of parameters that describe the stochastic process as $\theta$. Thus, when we speak of probability we want to calculate $P(O|\theta)$. In other words, given specific values for $\theta$, $P(O|\theta)$ is the probability that we would observe the outcomes represented by $O$.
However, when we model a real life stochastic process, we often do not know $\theta$. We simply observe $O$ and the goal then is to arrive at an estimate for $\theta$ that would be a plausible choice given the observed outcomes $O$. We know that given a value of $\theta$ the probability of observing $O$ is $P(O|\theta)$. Thus, a 'natural' estimation process is to choose that value of $\theta$ that would maximize the probability that we would actually observe $O$. In other words, we find the parameter values $\theta$ that maximize the following function:
$L(\theta|O) = P(O|\theta)$
$L(\theta|O)$ is called the likelihood function. Notice that by definition the likelihood function is conditioned on the observed $O$ and that it is a function of the unknown parameters $\theta$.
Continuous Random Variables
In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed $O$ given $\theta$ because in the continuous case $P(O|\theta) = 0$. Without getting into technicalities, the basic idea is as follows:
Denote the probability density function (pdf) associated with the outcomes $O$ as: $f(O|\theta)$. Thus, in the continuous case we estimate $\theta$ given observed outcomes $O$ by maximizing the following function:
$L(\theta|O) = f(O|\theta)$
In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe $O$ as we maximize the PDF associated with the observed outcomes $O$.
From a technical point of view, here is the argument:
For densities (but the argument is analogous in the discrete case), we write $$ \pi \left( \theta |y\right) =\frac{f\left( y|\theta \right) \pi \left(\theta \right) }{f(y)} $$ The norming constant can be obtained as, by writing a marginal density as a joint density and then writing the joint as conditional times marginal, with the other parameter integrated out, \begin{align*} f(y)&=\int f\left( y,\theta \right) d\theta\\ &=\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta \end{align*} It ensures integration to 1 because \begin{align*} \int \pi \left( \theta |y\right) d\theta&=\int\frac{f\left( y|\theta \right) \pi \left(\theta \right) }{\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta}d\theta\\ &=\frac{\int f\left( y|\theta \right) \pi \left(\theta \right) d\theta}{\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta}\\ &=1, \end{align*} where we can "take out" the integral in the denominator because $\theta$ had already been integrated out there.
Best Answer
In Bayesian statistics, the marginal likelihood $$m(x) = \int_\Theta f(x|\theta)\pi(\theta)\,\text d\theta$$ where
is a misnomer in that
Other names for $m(x)$ are evidence, prior predictive, partition function. It has however several important roles:
See also
Normalizing constant in Bayes theorem
Normalizing constant irrelevant in Bayes theorem?
Intuition of Bayesian normalizing constant