Solved – Conditional Maximum Likelihood – How is marginal probability of inputs are independent of the parameter we are estimating

conditional probabilitymaximum likelihood

I am reading "A Primer in Econometric Theory" by John Stachurski and reading the part on Conditional Maximum Likelihood. There I have seen the same kind of maximization I have seen before in other sources too: In order to estimate the parameter of a distribution, author uses conditional maximum likelihood and he does not take into account the marginal density of inputs when maximizing the objective function. I have seen it being done before, but he openly says marginal density is independent of the parameter we are estimating. Let me explain what I am asking:

Suppose we have inputs and outputs in our model, $x$ being the input and $y$ being the output: $x_1, x_2$…$x_N$ and $y_1, y_2$…$y_N$. They come in pairs. Each observation $x_i$ is probably a vector but $y_i$ is just a scalar. Our pair is $(x_i, y_i)$. Our aim is to estimate $\theta$ in $p(y|x;\theta)$ in order to pin down the conditional density of $y$ given $x$.

So we maximize the following condition:

$l(\theta) = \sum_{n=1}^{N}\text{ln}\,p(x_n,y_n;\theta)$ where $p$ is the joint density of $(x_n,y_n)$.

Letting $\pi$ be the marginal density of $x$, we can decompose the joint density as:

$p(x,y;\theta) = p(y|x;\theta)\pi(x)$.

Here author says the following:

"The density $\pi(x)$ is unknown but we have not parameterized it because we aren't trying to estimate it. We can now rewrite the log likelihood as

$l(\theta) = \sum_{n=1}^{N}\text{ln}\,p(y_n|x_n;\theta) + \sum_{n=1}^{N}\text{ln}\,\pi(x_n)$

The second term on the right-hand side is independent of $\theta$ and as such it does not affect the maximizer"

And he goes on just maximizing the first part, the conditional probability.

Here is my question: Is not $\pi(x)$ dependent on $\theta$ somehow? My thinking is convoluted but both $p(y|x;\theta)$ and $\pi(x)$ are derived from the same underlying joint density: $p(x, y;\theta)$. $\pi(x)$ is just a short hand for

$\pi(x) = \int p(x,y;\theta)dy$ which is clearly dependent on $\theta$. Once you plug this into your maximization problem, it becomes:

$l(\theta) = \sum_{n=1}^{N}\text{ln}\,p(y_n|x_n;\theta) + \sum_{n=1}^{N}\text{ln}\,\int p(x,y;\theta)dy$

Now the second item is also dependent on $\theta$ and need to be taken into account when maximizing wrt $\theta$. Am I missing something here?

Best Answer

You don't specify exactly what model you are estimating, but I will assume that it is a classical linear regression model, as is standard for introductory econometrics explanations. This is an important point.

Is not $\pi(x)$ dependent on $\theta$ somehow? My thinking is convoluted but both $p(y|x;\theta)$ and $\pi(x)$ are derived from the same underlying joint density: $p(x, y;\theta)$.

Your question gets to the heart of what exactly you are estimating, and why. A classical linear regression model only cares about the conditional probability $p(y|x)$ because it is a discriminative model. It is not a generative model and so it doesn't care about $\pi(x)$. This is a fundamental feature of the model itself.

You're right in saying that the prior $\pi(x)$ could also be parameterised, and some other references explicitly do this. For example, Hayashi pp. 47-48 does something like the following: $p(y,x;\xi)=p(y|x,\theta)\pi(x;\psi)$; i.e. it distinguishes between parameters for the prior $\psi$ and for the conditional likelihood $\theta$ (which together comprise the entire parameter set $\xi=\theta \cup \psi$). But that doesn't change the fact that the model only cares about the conditional likelihood function $p(y|x,\theta)$, which is a function only of the parameters $\theta$.

Related Question