I am reading "A Primer in Econometric Theory" by John Stachurski and reading the part on Conditional Maximum Likelihood. There I have seen the same kind of maximization I have seen before in other sources too: In order to estimate the parameter of a distribution, author uses conditional maximum likelihood and he does not take into account the marginal density of inputs when maximizing the objective function. I have seen it being done before, but he openly says marginal density is independent of the parameter we are estimating. Let me explain what I am asking:
Suppose we have inputs and outputs in our model, $x$ being the input and $y$ being the output: $x_1, x_2$…$x_N$ and $y_1, y_2$…$y_N$. They come in pairs. Each observation $x_i$ is probably a vector but $y_i$ is just a scalar. Our pair is $(x_i, y_i)$. Our aim is to estimate $\theta$ in $p(y|x;\theta)$ in order to pin down the conditional density of $y$ given $x$.
So we maximize the following condition:
$l(\theta) = \sum_{n=1}^{N}\text{ln}\,p(x_n,y_n;\theta)$ where $p$ is the joint density of $(x_n,y_n)$.
Letting $\pi$ be the marginal density of $x$, we can decompose the joint density as:
$p(x,y;\theta) = p(y|x;\theta)\pi(x)$.
Here author says the following:
"The density $\pi(x)$ is unknown but we have not parameterized it because we aren't trying to estimate it. We can now rewrite the log likelihood as
$l(\theta) = \sum_{n=1}^{N}\text{ln}\,p(y_n|x_n;\theta) + \sum_{n=1}^{N}\text{ln}\,\pi(x_n)$
The second term on the right-hand side is independent of $\theta$ and as such it does not affect the maximizer"
And he goes on just maximizing the first part, the conditional probability.
Here is my question: Is not $\pi(x)$ dependent on $\theta$ somehow? My thinking is convoluted but both $p(y|x;\theta)$ and $\pi(x)$ are derived from the same underlying joint density: $p(x, y;\theta)$. $\pi(x)$ is just a short hand for
$\pi(x) = \int p(x,y;\theta)dy$ which is clearly dependent on $\theta$. Once you plug this into your maximization problem, it becomes:
$l(\theta) = \sum_{n=1}^{N}\text{ln}\,p(y_n|x_n;\theta) + \sum_{n=1}^{N}\text{ln}\,\int p(x,y;\theta)dy$
Now the second item is also dependent on $\theta$ and need to be taken into account when maximizing wrt $\theta$. Am I missing something here?
Best Answer
You don't specify exactly what model you are estimating, but I will assume that it is a classical linear regression model, as is standard for introductory econometrics explanations. This is an important point.
Your question gets to the heart of what exactly you are estimating, and why. A classical linear regression model only cares about the conditional probability $p(y|x)$ because it is a discriminative model. It is not a generative model and so it doesn't care about $\pi(x)$. This is a fundamental feature of the model itself.
You're right in saying that the prior $\pi(x)$ could also be parameterised, and some other references explicitly do this. For example, Hayashi pp. 47-48 does something like the following: $p(y,x;\xi)=p(y|x,\theta)\pi(x;\psi)$; i.e. it distinguishes between parameters for the prior $\psi$ and for the conditional likelihood $\theta$ (which together comprise the entire parameter set $\xi=\theta \cup \psi$). But that doesn't change the fact that the model only cares about the conditional likelihood function $p(y|x,\theta)$, which is a function only of the parameters $\theta$.