The answer depends on whether you are dealing with discrete or continuous random variables. So, I will split my answer accordingly. I will assume that you want some technical details and not necessarily an explanation in plain English.
Discrete Random Variables
Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is $p$ and that coin tosses are independent).
Denote the observed outcomes by $O$ and the set of parameters that describe the stochastic process as $\theta$. Thus, when we speak of probability we want to calculate $P(O|\theta)$. In other words, given specific values for $\theta$, $P(O|\theta)$ is the probability that we would observe the outcomes represented by $O$.
However, when we model a real life stochastic process, we often do not know $\theta$. We simply observe $O$ and the goal then is to arrive at an estimate for $\theta$ that would be a plausible choice given the observed outcomes $O$. We know that given a value of $\theta$ the probability of observing $O$ is $P(O|\theta)$. Thus, a 'natural' estimation process is to choose that value of $\theta$ that would maximize the probability that we would actually observe $O$. In other words, we find the parameter values $\theta$ that maximize the following function:
$L(\theta|O) = P(O|\theta)$
$L(\theta|O)$ is called the likelihood function. Notice that by definition the likelihood function is conditioned on the observed $O$ and that it is a function of the unknown parameters $\theta$.
Continuous Random Variables
In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed $O$ given $\theta$ because in the continuous case $P(O|\theta) = 0$. Without getting into technicalities, the basic idea is as follows:
Denote the probability density function (pdf) associated with the outcomes $O$ as: $f(O|\theta)$. Thus, in the continuous case we estimate $\theta$ given observed outcomes $O$ by maximizing the following function:
$L(\theta|O) = f(O|\theta)$
In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe $O$ as we maximize the PDF associated with the observed outcomes $O$.
There are three types of statisticians;
- those that (prefer to) work with real data,
- those that (prefer to) work with simulated data,
- those that (prefer to) work with the symbol $X$.
math stat types would be (3). Typically, type
(1) statisticians have some prefix attached to
make clear the source of the data they work with
(biostatistics, econometrics, psychometrics,....)
because these fields have implicit shared
assumptions about the data they use and some commonly accepted ordering
of the plausibility of these assumptions.
Best Answer
The likelihood relates your data to a set of parameters. It is typically written as: $P(D | \theta)$ (or $\mathcal{L}(\theta | D)$ because the likelihood can be viewed as a function of the parameters - holding the data constant).
where $\theta$ contains all of the parameters necessary for the model. For example, consider we have a bunch of iid data $X = \{x_1, ..., x_n\}$ and we want to see how well this fits to a Normal distribution. $\theta = \{\mu, \sigma\}$, and $P(D | \theta) = \prod_i \mathcal{N}(x_i; \mu, \sigma)$. One approach to fitting this model would be to maximize the parameter values according to maximum likelihood. This is exactly what it sounds like. We take the likelihood function, and attempt to maximize it by changing the parameter settings (keeping the observed data constant). This is usually done by computing the derivative of the likelihood w.r.t. each parameter, setting to 0 and solving (side note: it is common to first take the logarithm of the likelihood function to make the derivatives easier to solve).
Alternatively, we could take a Bayesian approach and assign a prior probability distribution over the parameters and compute the posterior distribution to fit the parameters: $P(\theta | D) \propto P(D | \theta) P(\theta)$. In this case we treat the parameters as random variables ad thus must define a distribution over their possible values. The prior distribution can encode any prior knowledge we may have about the variables. For instance, we may have a good idea of the possible ranges for $\mu$ and could thus assign a prior distribution that pushes the $\mu$ slightly toward these values.
To recap: Likelihood: $P(D | \theta)$ links data to parameters Prior: $P(\theta)$ distribution over possible parameter values (used in Bayesian analysis)