I think this is largely unnecessary splitting hairs.
Conditional probability $P(x\mid y)\equiv P(X=x \mid Y=y)$ of $x$ given $y$ is defined for two random variables $X$ and $Y$ taking values $x$ and $y$. But we can also talk about probability $P(x\mid\theta)$ of $x$ given $\theta$ where $\theta$ is not a random variable but a parameter.
Note that in both cases the same term "given" and the same notation $P(\cdot\mid\cdot)$ can be used. There is no need to invent different notations. Moreover, what is called "parameter" and what is called "random variable" can depend on your philosophy, but the math does not change.
The first quote from Wikipedia states that $\mathcal{L}(\theta \mid x) = P(x \mid \theta)$ by definition. Here it is assumed that $\theta$ is a parameter. The second quote says that $\mathcal{L}(\theta \mid x)$ is not a conditional probability. This means that it is not a conditional probability of $\theta$ given $x$; and indeed it cannot be, because $\theta$ is assumed to be a parameter here.
In the context of Bayes theorem $$P(a\mid b)=\frac{P(b\mid a)P(a)}{P(b)},$$ both $a$ and $b$ are random variables. But we can still call $P(b\mid a)$ "likelihood" (of $a$), and now it is also a bona fide conditional probability (of $b$). This terminology is standard in Bayesian statistics. Nobody says it is something "similar" to the likelihood; people simply call it the likelihood.
Note 1: In the last paragraph, $P(b\mid a)$ is obviously a conditional probability of $b$. As a likelihood $\mathcal L(a\mid b)$ it is seen as a function of $a$; but it is not a probability distribution (or conditional probability) of $a$! Its integral over $a$ does not necessarily equal $1$. (Whereas its integral over $b$ does.)
Note 2: Sometimes likelihood is defined up to an arbitrary proportionality constant, as emphasized by @MichaelLew (because most of the time people are interested in likelihood ratios). This can be useful, but is not always done and is not essential.
See also What is the difference between "likelihood" and "probability"? and in particular @whuber's answer there.
I fully agree with @Tim's answer in this thread too (+1).
One precise formulation of Bayes' Theorem is the following, taken verbatim from Schervish's Theory of Statistics (1995).
The conditional distribution of $\Theta$ given $X=x$ is called the posterior distribution of $\Theta$.
The next theorem shows us how to calculate the posterior distribution of a parameter in the case in which there is a measure $\nu$ such that each $P_\theta \ll \nu$.
Theorem 1.31 (Bayes' theorem).
Suppose that $X$ has a parametric family $\mathcal{P}_0$ of distributions with parameter space $\Omega$.
Suppose that $P_\theta \ll \nu$ for all $\theta \in \Omega$, and let $f_{X\mid\Theta}(x\mid\theta)$ be the conditional density (with respect to $\nu$) of $X$ given $\Theta = \theta$.
Let $\mu_\Theta$ be the prior distribution of $\Theta$.
Let $\mu_{\Theta\mid X}(\cdot \mid x)$ denote the conditional distribution of $\Theta$ given $X = x$.
Then $\mu_{\Theta\mid X} \ll \mu_\Theta$, a.s. with respect to the marginal of $X$, and the Radon-Nikodym derivative is
$$
\tag{1}
\label{1}
\frac{d\mu_{\Theta\mid X}}{d\mu_\Theta}(\theta \mid x)
= \frac{f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)}
$$
for those $x$ such that the denominator is neither $0$ nor infinite.
The prior predictive probability of the set of $x$ values such that the denominator is $0$ or infinite is $0$, hence the posterior can be defined arbitrarily for such $x$ values.
Edit 1.
The setup for this theorem is as follows:
- There is some underlying probability space $(S, \mathcal{S}, \Pr)$ with respect to which all probabilities are computed.
- There is a standard Borel space $(\mathcal{X}, \mathcal{B})$ (the sample space) and a measurable map $X : S \to \mathcal{X}$ (the sample or data).
- There is a standard Borel space $(\Omega, \tau)$ (the parameter space) and a measurable map $\Theta : S \to \Omega$ (the parameter).
- The distribution of $\Theta$ is $\mu_\Theta$ (the prior distribution); this is the probability measure on $(\Omega, \tau)$ given by $\mu_\Theta(A) = \Pr(\Theta \in A)$ for all $A \in \tau$.
- The distribution of $X$ is $\mu_X$ (the marginal distribution mentioned in the theorem); this is the probability measure on $(\mathcal{X}, \mathcal{B})$ given by $\mu_X(B) = \Pr(X \in B)$ for all $B \in \mathcal{B}$.
There is a probability kernel $P : \Omega \times \mathcal{B} \to [0, 1]$, denoted $(\theta, B) \mapsto P_\theta(B)$ which represents the conditional distribution of $X$ given $\Theta$. This means that
- for each $B \in \mathcal{B}$, the map $\theta \mapsto P_\theta(B)$ from $\Omega$ into $[0, 1]$ is measurable,
- $P_\theta$ is a probability measure on $(\mathcal{X}, \mathcal{B})$ for each $\theta \in \Omega$, and
- for all $A \in \tau$ and $B \in \mathcal{B}$,
$$
\Pr(\Theta \in A, X \in B) = \int_A P_\theta(B) \, d\mu_\Theta(\theta).
$$
This is the parametric family of distributions of $X$ given $\Theta$.
- We assume that there exists a measure $\nu$ on $(\mathcal{X}, \mathcal{B})$ such that $P_\theta \ll \nu$ for all $\theta \in \Omega$, and we choose a version $f_{X\mid\Theta}(\cdot\mid\theta)$ of the Radon-Nikodym derivative $d P_\theta / d \nu$ (strictly speaking, the guaranteed existence of this Radon-Nikodym derivative might require $\nu$ to be $\sigma$-finite).
This means that
$$
P_\theta(B) = \int_B f_{X\mid\Theta}(x \mid \theta) \, d\nu(x)
$$
for all $B \in \mathcal{B}$.
It follows that
$$
\Pr(\Theta \in A, X \in B)
= \int_A \int_B f_{X \mid \Theta}(x \mid \theta) \, d\nu(x) \, d\mu_\Theta(\theta)
$$
for all $A \in \tau$ and $B \in \mathcal{B}$. We may assume without loss of generality (e.g., see exercise 9 in Chapter 1 of Schervish's book) that the map $(x, \theta) \mapsto f_{X\mid \Theta}(x\mid\theta)$ of $\mathcal{X}\times\Omega$ into $[0, \infty]$ is measurable. Then by Tonelli's theorem we can change the order of integration:
$$
\Pr(\Theta \in A, X \in B)
= \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x)
$$
for all $A \in \tau$ and $B \in \mathcal{B}$.
In particular, the marginal probability of a set $B \in \mathcal{B}$ is
$$
\mu_X(B) = \Pr(X \in B)
= \int_B \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x),
$$
which shows that $\mu_X \ll \nu$, with Radon-Nikodym derivative
$$
\frac{d\mu_X}{d\nu}
= \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta).
$$
- There exists a probability kernel $\mu_{\Theta \mid X} : \mathcal{X} \times \tau \to [0, 1]$, denoted $(x, A) \mapsto \mu_{\Theta \mid X}(A \mid x)$, which represents the conditional distribution of $\Theta$ given $X$ (i.e., the posterior distribution).
This means that
- for each $A \in \tau$, the map $x \mapsto \mu_{\Theta \mid X}(A \mid x)$ from $\mathcal{X}$ into $[0, 1]$ is measurable,
- $\mu_{\Theta \mid X}(\cdot \mid x)$ is a probability measure on $(\Omega, \tau)$ for each $x \in \mathcal{X}$, and
- for all $A \in \tau$ and $B \in \mathcal{B}$,
$$
\Pr(\Theta \in A, X \in B) = \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x)
$$
Edit 2.
Given the setup above, the proof of Bayes' theorem is relatively straightforward.
Proof.
Following Schervish, let
$$
C_0 = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = 0\right\}
$$
and
$$
C_\infty = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \infty\right\}
$$
(these are the sets of potentially problematic $x$ values for the denominator of the right-hand-side of \eqref{1}).
We have
$$
\mu_X(C_0)
= \Pr(X \in C_0)
= \int_{C_0} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = 0,
$$
and
$$
\mu_X(C_\infty)
= \int_{C_\infty} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x)
= \begin{cases}
\infty, & \text{if $\nu(C_\infty) > 0$,} \\
0, & \text{if $\nu(C_\infty) = 0$.}
\end{cases}
$$
Since $\mu_X(C_\infty) = \infty$ is impossible ($\mu_X$ is a probability measure), it follows that $\nu(C_\infty) = 0$, whence $\mu_X(C_\infty) = 0$ as well.
Thus, $\mu_X(C_0 \cup C_\infty) = 0$, so the set of all $x \in \mathcal{X}$ such that the denominator of the right-hand-side of \eqref{1} is zero or infinite has zero marginal probability.
Next, consider that, if $A \in \tau$ and $B \in \mathcal{B}$, then
$$
\Pr(\Theta \in A, X \in B)
= \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x)
$$
and simultaneously
$$
\begin{aligned}
\Pr(\Theta \in A, X \in B)
&= \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) \\
&= \int_B \left( \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \right) \, d\nu(x).
\end{aligned}
$$
It follows that
$$
\mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)
= \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta)
$$
for all $A \in \tau$ and $\nu$-a.e. $x \in \mathcal{X}$, and hence
$$
\mu_{\Theta \mid X}(A \mid x)
= \int_A \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)} \, d\mu_\Theta(\theta)
$$
for all $A \in \tau$ and $\mu_X$-a.e. $x \in \mathcal{X}$.
Thus, for $\mu_X$-a.e. $x \in \mathcal{X}$, $\mu_{\Theta\mid X}(\cdot \mid x) \ll \mu_\Theta$, and the Radon-Nikodym derivative is
$$
\frac{d\mu_{\Theta \mid X}}{d \mu_\Theta}(\theta \mid x)
= \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)},
$$
as claimed, completing the proof.
Lastly, how do we reconcile the colloquial version of Bayes' theorem found so commonly in statistics/machine learning literature, namely,
$$
\tag{2}
\label{2}
p(\theta \mid x)
= \frac{p(\theta) p(x \mid \theta)}{p(x)},
$$
with \eqref{1}?
On the one hand, the left-hand-side of \eqref{2} is supposed to represent a density of the conditional distribution of $\Theta$ given $X$ with respect to some unspecified dominating measure on the parameter space.
In fact, none of the dominating measures for the four different densities in \eqref{2} (all named $p$) are explicitly mentioned.
On the other hand, the left-hand-side of \eqref{1} is the density of the conditional distribution of $\Theta$ given $X$ with respect to the prior distribution.
If, in addition, the prior distribution $\mu_\Theta$ has a density $f_\Theta$ with respect to some (let's say $\sigma$-finite) measure $\lambda$ on the parameter space $\Omega$, then $\mu_{\Theta \mid X}(\cdot\mid x)$ is also absolutely continuous with respect to $\lambda$ for $\mu_X$-a.e. $x \in \mathcal{X}$, and if $f_{\Theta \mid X}$ represents a version of the Radon-Nikodym derivative $d\mu_{\Theta\mid X}/d\lambda$, then \eqref{1} yields
$$
\begin{aligned}
f_{\Theta \mid X}(\theta \mid x)
&= \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x) \\
&= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) \frac{d \mu_{\Theta}}{d\lambda}(\theta) \\
&= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) f_\Theta(\theta) \\
&= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} \\
&= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t)}.
\end{aligned}
$$
The translation between this new form and \eqref{2} is
$$
\begin{aligned}
p(\theta \mid x) &= f_{\Theta \mid X}(\theta \mid x) = \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x), &&\text{(posterior)}\\
p(\theta) &= f_\Theta(\theta) = \frac{d \mu_\Theta}{d\lambda}(\theta), &&\text{(prior)} \\
p(x \mid \theta) &= f_{X\mid\Theta}(x\mid\theta) = \frac{d P_\theta}{d\nu}(x), &&\text{(likelihood)} \\
p(x) &= \int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t). &&\text{(evidence)}
\end{aligned}
$$
Best Answer
I think maybe the best way to explain the notion of likelihood is to consider a concrete example. Suppose I have a sample of IID observations drawn from a Bernoulli distribution with unknown probability of success $p$: $X_i \sim {\rm Bernoulli}(p)$, $i = 1, \ldots, n$, so the joint probability mass function of the sample is $$\Pr[{\boldsymbol X} = \boldsymbol x \mid p] = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i}.$$ This expression also characterizes the likelihood of $p$, given an observed sample $\boldsymbol x = (x_1, \ldots, x_n)$: $$L(p \mid \boldsymbol x) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i}.$$ But if we think of $p$ as a random variable, this likelihood is not a density: $$\int_{p=0}^1 L(p \mid \boldsymbol x) \, dp \ne 1.$$ It is, however, proportional to a probability density, which is why we say it is a likelihood of $p$ being a particular value given the sample--it represents, in some sense, the relative plausibility of $p$ being some value for the observations we made.
For instance, suppose $n = 5$ and the sample was $\boldsymbol x = (1, 1, 0, 1, 1)$. Intuitively we would conclude that $p$ is more likely to be closer to $1$ than to $0$, because we observed more ones. Indeed, we have $$L(p \mid \boldsymbol x) = p^4 (1 - p).$$ If we plot this function on $p \in [0,1]$, we can see how the likelihood confirms our intuition. Of course, we do not know the true value of $p$--it could have been $p = 0.25$ rather than $p = 0.8$, but the likelihood function tells us that the former is much less likely than the latter. But if we want to determine a probability that $p$ lies in a certain interval, we have to normalize the likelihood: since $\int_{p=0}^1 p^4(1-p) \, dp = \frac{1}{30}$, it follows that in order to get a posterior density for $p$, we must multiply by $30$: $$f_p(p \mid \boldsymbol x) = 30p^4(1-p).$$ In fact, this posterior is a beta distribution with parameters $a = 5, b = 2$. Now the areas under the density correspond to probabilities.
So, what we have essentially done here is applied Bayes' rule: $$f_{\boldsymbol \Theta}(\boldsymbol \theta \mid \boldsymbol x) = \frac{f_{\boldsymbol X}(\boldsymbol x \mid \boldsymbol \theta) f_{\boldsymbol \Theta}(\boldsymbol \theta)}{f_{\boldsymbol X}(\boldsymbol x)}.$$ Here, $f_{\boldsymbol \Theta}(\boldsymbol \theta)$ is a prior distribution on the parameter(s) $\boldsymbol \theta$, the numerator is the likelihood $L(\boldsymbol \theta \mid \boldsymbol x) = f_{\boldsymbol X}(\boldsymbol x \mid \boldsymbol \theta) f_{\boldsymbol \Theta}(\boldsymbol \theta) = f_{\boldsymbol X, \boldsymbol \Theta}(\boldsymbol x, \boldsymbol \theta)$ which is also the joint distribution of $\boldsymbol X, \boldsymbol \Theta$, and the denominator is the marginal (unconditional) density of $\boldsymbol X$, obtained by integrating the joint distribution with respect to $\boldsymbol \theta$ to find the normalizing constant that makes the likelihood a probability density with respect to the parameter(s). In our numerical example, we implicitly took the prior for $f_{\boldsymbol \Theta}$ to be uniform on $[0,1]$. It can be shown that, for a Bernoulli sample, if the prior is ${\rm Beta}(a,b)$, the posterior for $f_{\boldsymbol \Theta}$ is also Beta, but with parameters $a^* = a+\sum x_i$, $b^* = b + n - \sum x_i$. We call such a prior conjugate (and refer to this as a Bernoulli-Beta conjugate pair).