Solved – Defining Conditional Likelihood

likelihoodmaximum likelihood

Consider a set of $m$ examples $X=\{x^{(1)},x^{(2)},\cdots, x^{(m)}\}$ drawn independently from the true but unknown data-generating distribution $p_{\text{data}}(x)$.

Let $p_{\text{model}}$ be a parametric family of probability distributions over the same space indexed by $\theta$.

The likelihood function is defined as $L(\theta|X) =p_{\text{model}}(x^{(1)}, x^{(2)},\cdots,x^{(m)};\theta)$.

Because of the independence assumption, we can write $L(\theta|X)=\prod_{i=1}^m p_{\text{model}}(x^{(i)};\theta)$

Now, suppose we are $m$ examples $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots, (x^{(m)},y^{(m)})\}$ and we are want to estimate the conditional probability $P(y|x;\theta)$ using the conditional maximum likelihood estimation, so that we can predict $y$ given $x$ .

How do I define the conditional likelihood? The problem I'am facing is that we cannot write $L(\theta; y|x)=P(Y|X;\theta)$ since $X$ contains different $x^{(i)}$s. But I think it makes sense to 'define' $L(\theta;y|x)=\prod_{i=1}^mP(y^{(i)}|x^{(i)};\theta)$ although in unconditional likelihood this followed from the definition of likelihood and the independence assumption. But I don't think I am comfortable with this. How does one define conditional likelihood?

Best Answer

Usually one assumes that there is a distribution $$p_{\text{data}}(y,x)$$ that somewhat defines not only the distributions of $x$ and $y$ but also their dependency (i.e. if $y_i = f(x_i)$ then we can estimate $f$ by computing the conditional expectation $E[y|X=x]$ with respect to this common probability distribution and so on). Now we do not only assume that the $x_i$ were drawn independently but rather that the whole tuples $(y_1, x_1), ..., (y_n,x_n)$ were drawn independently. Caution: this does in no way mean that $y_i$ is somewhat independent from $x_i$, it only means that $$p(y,x) = \prod_{i=1}^n p(y_i, x_i)$$ and by using marginalization and the Theorem of Fubini (wikipedia) we see that \begin{align*} p(y|x) &= \frac{p(y,x)}{p(x)} = \frac{p(y,x)}{\int p(\hat{y}, x) d\hat{y}} \\ &= \frac{p(y,x)}{\int ... \int \prod_{i=1}^n p(\hat{y}_i, x) d\hat{y}_1 ... d\hat{y}_n} \\ &= \frac{\prod_{i=1}^n p(y_i,x_i)}{\prod_{i=1}^n \int p(\hat{y}_i, x) d\hat{y}_i} \\ &= \prod_{i=1}^n p(y_i|x_i) \end{align*}

So, you can safely feel comfortable with this, it follows from the basic assumptions we always have: the observed data are independent.

Edit: note that usually we do not make assumptions about the joint probability $p(y,x)$ or $p(y_i, x_i)$ directly but we rather assume that $y_i = f(x_i) + \text{'small' error}$ for a single function $f$ and then we make assumptions on $f$, for example, in linear regression we assume that $$f(x) = \beta^T \cdot x$$

Related Question