Solved – Why maximum likelihood estimation use the product of pdfs rather than cdfs

cumulative distribution functiondensity functionmaximum likelihood

I'm learning logistic regression and got confused when I saw the equation of the textbook. I knew that for a continuous distribution, to calculate the probability, the pdf $f(x)$ is meaningless. Instead the cumulative density function $F(x)$ shall be used. Thus, since we're maximizing the probability, shouldn't we use the product of cdfs rather than pdfs on the right side of the MLE equation? Thank you!

enter image description here

UPDATE, and further questions:

This question brings up an interesting point about why we don't often use the fact that $Y=F(X)\sim U(0,1)$ and then try to minimise the KL divergence between $Y$ and $U$:

$$\text{KL}(Y, U) = \int_0^1 f_y(y) \ln f_y(y) \text{d}y$$

Typically we have easy access to the form of $f$ (the original pdf) but $F$ might be less tractable and $f_Y$ is basically something we would need to estimate using empirical CDFs based on the samples $F(X_i), i=…$. The question is, are the two formulations (the usual MLE and the KL version above) very different in their results?

Best Answer

How can a CDF be used to rank two possible parametrizations for a model? It is a cumulative probability, so it can only tell us the probability of obtaining such a result or a lower value given a probability model. If we took $\theta$ to predict the smallest possible outcomes, the CDF is nearly 1 at every observation and this would be the most "likely" in the sense that "yup, if the mean height were truly -99 I am very confident that repeating my sample would produce values smaller than the ones I observed".

We could balance the left cumulative probability with the right cumulative probability. Consider the converse in our calculation: a median unbiased estimator satisfies:

$$P(X < \theta) = P(X > \theta)$$

Here the best value of $\theta$ is the one for which $X$ is equally likely to be greater or less than it's predicted value (assuming $\theta$ is a mean here). But that certainly doesn't correspond with our idea of being able to rank alternate parameterizations as more likely for a particular sample.

Perhaps, on the other hand you wanted to be sure $X$ was very probable in a small interval of the value, that is maximize this probability:

$$P(\theta - d < X < \theta + d)/d = \left(F(X+d) - F(X-d)\right)/d$$

But how big should $d$ be? Well if $d$ is taken to be arbitrarily small:

$$\lim_{d \rightarrow 0} \left(F(X+d) - F(X-d)\right)/d = f(X)$$

And you get the density. It is the instantaneous probability function that best characterizes the likelihood of a specific observation under a parametrization.

Related Question