P(D | w) vs P(y | x, w) – likelihood notation

bayesianmaximum likelihoodprobabilityprobability theorystatistics

In certain ML books/lectures/slides/notes about Bayesian inference, you often see the likelihood written as $P(\mathcal{D} | w) = \Pi_{i \in N} P(y_i | x_i, w)$, where $\mathcal{D} = {(x_1, y_1),~ …~,(x_n, y_n)}$. It then follows that the equation for Bayes rule is:

$$ P(w | \mathcal{D}) = \frac{P(\mathcal{D} | w) \cdot p(w)}{P(\mathcal{D})} $$

I'm confused because if you separate out the covariates from the response (inputs from targets), you could get the following equation for Bayes rule:

$$ P(w | x_i, y_i) = \frac{P(y_i | x_i, w) \cdot p(w | x_i)}{P(x_i, y_i)} $$

These two equations don't match up, in particular $p(w) \neq p(w | x_i)$. So what is the justification for this? In both a frequentist sense and a Bayesian sense?

I know that, for example wikipedia, defines the likehood function as the joint probability of the observed data, which would allow for this confusion to make sense. And when performing MLE, you maximize the response/targets in conjunction with the covariates/inputs, so it makes sense intuitively for this notation, but what would be the theoretical justification?

Edit:
I suppose I'm more confused by why would $P(y_i | x_i, w)$ have the same formulation for the likelihood function as $P(\mathcal{D} | w)$.

Best Answer

The likelihood in a parametric model is the joint density (or mass function) of the observed data conditional on any parameters in the model (which are treated as random in a Bayes context).

So using $f$ to denote a density or mass function, and letting $x$ denote the joint $x_i$ data, $y$ denote the joint $y_i$ data, and $\theta$ denote parameters, I would write the likelihood as

$$f(y,x|\theta)=f(y|x,\theta)f(x|\theta).$$

Note that if the density of $x$ doesn't depend on $\theta$, maximizing likelihood over the parameter space is equivalent to maximizing $f(y|x,\theta),$ since the density of $x$ just scales this up.

As far as using Bayes' rule, we have $$f(\theta|y,x)=\frac{f(y,x|\theta)f(\theta)}{f(y,x)}=\frac{f(y|x,\theta)f(\theta|x)f(x)}{f(y,x)},$$

and often in a Bayesian context you will see this is written up to proportionality since other terms are simply normalizing constants (since $\theta$ is the random variable of interest):

$$f(\theta|y,x)\propto f(y,x|\theta)f(\theta)\propto f(y|x,\theta)f(\theta|x).$$

If you are confused at any point, just write out everything in terms of joint and marginal densities or masses to convince yourself of the correct identity. Hopefully that helps.

Related Question