Logistic Regression – Comparing MLE vs MAP vs Conditional MLE

logisticmaximum likelihood

We have some set of iid RV's: $(X_i, Y_i), \; i=1,\ldots n$.

We believe each to be distributed as $P(X_i, Y_i | \theta)$. So that
$$
P(X,Y | \theta) = \prod_i P_i(X_i, Y_i | \theta)
$$
Now using Baye's rule:
$$
P(\theta|X,Y) = \frac{P(X,Y|\theta)P(\theta)}{P(X,Y)} = \frac{P(\theta)\prod_i P_i(X_i, Y_i | \theta)}{P(X,Y)}
$$

As I understand it, MLE, MAP, and conditional MLE all attempt to find the best parameters, $\theta$, given the data by maximizing the left hand side by maximizing a subset of terms on the right.

For MLE, we maximize the likelihood term, $\prod_i P_i(X_i, Y_i | \theta)$.

For MAP, we maximize all of the numerator, $P(\theta)\prod_i P_i(X_i, Y_i | \theta)$.

For conditional MLE (as in logistic regression), we have
$$
\frac{P(\theta)\prod_i P_i(X_i, Y_i | \theta)}{P(X,Y)} = \frac{P(\theta) \left( \prod_i P_i(Y_i | X_i, \theta) \right) \left( \prod_i P(X_i|\theta) \right) }{P(X,Y)}
$$

Conditional MLE maximizes only the $\prod_i P_i(Y_i | X_i, \theta)$ term.

Is this correct? I have seen regularized logistic regression amounting to maximizing the prior, $P(\theta)$. Would modeling the third distribution for conditional MLE, $\prod_i P(X_i|\theta)$, be something different entirely?

I understand that logistic regression is discriminative model. Is this a result of this? Would modeling $P(X_i|\theta)$ then give us a generative model?

Thanks for any pointers.

Best Answer

For MLE and MAP you are right. "Conditional MLE" is another way of saying "MLE in a conditional model". Logistic regression is a conditional model in the sense that $\theta$ only controls $P(Y|X)$ (it has no effect on $P(X)$). Therefore the MLE for logistic regression is a conditional MLE. If you have a model in which $\theta$ affects $P(X)$ then it is no longer a conditional model. The MLE in such a model cannot be regarded as a conditional MLE. A discriminative model is the same thing as a conditional model.

The Wikipedia pages for generative model and discriminative model do a reasonably good job of explaining this, however the definitions there do not correctly handle the case when $P(X)$ exists but does not depend on $\theta$. Wikipedia would say that such a model is generative, even though it would behave in every way like a discriminative model. I would regard such a model as discriminative. For more explanation, see Discriminative models, not discriminative training.

Related Question