Priors vs Likelihood – Understanding the Difference Between ‘Priors’ and ‘Likelihood’

likelihoodpriorterminology

I have been reading some papers regarding statistics and I seem to be confusing the terms priors and likelihood.

Would it be possible to explain the difference between the two terms? I am interested in both a "down-to-earth" approach with examples and the mathematical and statistical aspects.

Thanks.

Best Answer

The likelihood relates your data to a set of parameters. It is typically written as: $P(D | \theta)$ (or $\mathcal{L}(\theta | D)$ because the likelihood can be viewed as a function of the parameters - holding the data constant).

where $\theta$ contains all of the parameters necessary for the model. For example, consider we have a bunch of iid data $X = \{x_1, ..., x_n\}$ and we want to see how well this fits to a Normal distribution. $\theta = \{\mu, \sigma\}$, and $P(D | \theta) = \prod_i \mathcal{N}(x_i; \mu, \sigma)$. One approach to fitting this model would be to maximize the parameter values according to maximum likelihood. This is exactly what it sounds like. We take the likelihood function, and attempt to maximize it by changing the parameter settings (keeping the observed data constant). This is usually done by computing the derivative of the likelihood w.r.t. each parameter, setting to 0 and solving (side note: it is common to first take the logarithm of the likelihood function to make the derivatives easier to solve).

Alternatively, we could take a Bayesian approach and assign a prior probability distribution over the parameters and compute the posterior distribution to fit the parameters: $P(\theta | D) \propto P(D | \theta) P(\theta)$. In this case we treat the parameters as random variables ad thus must define a distribution over their possible values. The prior distribution can encode any prior knowledge we may have about the variables. For instance, we may have a good idea of the possible ranges for $\mu$ and could thus assign a prior distribution that pushes the $\mu$ slightly toward these values.

To recap: Likelihood: $P(D | \theta)$ links data to parameters Prior: $P(\theta)$ distribution over possible parameter values (used in Bayesian analysis)