Maximum Likelihood Estimation – Asymptotic Distribution for MLE Estimator of p(1-p) in Binary Variables

maximum likelihoodself-study

Let $X_i$ be a binary random variable where $P(X_i = 1) = p$.
I want to find the MLE estimator for $\theta = p(1-p)$.

The likelihood function should be
$$
L(\theta) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^{\sum_{i=1}^nx_i}(1-p)^{n- \sum_{i=1}^nx_i},
$$

How should I proceed?
Seems like I can not just take the derivative with respect to $\theta = p(1-p)$.

Edit: You use the plug in estimator, so you find $\hat{p}_{mle}$ first, then
$\hat{\theta} = \hat{p}(1-\hat{p})$.

Now suppose I want to find the asymptotic distribution for $\hat{\theta}$. how do I compute the information matrix?

Best Answer

There is no MLE of $\theta$.

$$L(\theta) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i}$$

This expression of the likelihood function is incorrect. The left-hand side expresses the likelihood as a function of $\theta$, but the right hand side is a function of $p$.

You can not fix this either because $\theta$ is a non-injective function of $p$ and can not be inverted. Given $\theta$ we have two possible values of $p$

$$p=\frac{1}{2}\pm\sqrt{\frac{1}{4}-\theta}$$

so you can not compute the probability of the data given $\theta$ unless you have a second parameter.


how do I compute the information matrix?

If you would have a valid MLE then you could start with the information matrix of some parameter and apply a scaling according to the square of the derivative of the transformation.

However, note that some transformed variable can be biased and the Fisher Information alone is not an indication of asymptotic variance. See this example: Why the variance of Maximum Likelihood Estimator(MLE) will be less than Cramer-Rao Lower Bound(CRLB)?


The special case of finding the variance of the distribution of the statistic $\hat\theta = \hat{p}(1-\hat{p}) = \hat{p}-\hat{p}^2$ can be done more directly by computing $$\begin{array}{} Var(\hat\theta) &=& E[(\hat{p}-\hat{p}^2)] - E[\hat{p}-\hat{p}^2]^2 \\ &=& E[\hat{p}^4-2\hat{p}^3+\hat{p}^2] - E[\hat{p}-\hat{p}^2]^2\\ &=& E[\hat{p}^4]-2E[\hat{p}^3]+E[\hat{p}^2] - (E[\hat{p}]-E[\hat{p}^2])^2 \end{array}$$

which can be expressed in terms of the raw moments of $\hat{p}$ (a binomial distributed variable, scaled by $1/n$)

$$\begin{array}{} E[\hat{p}] &=& p \\ E[\hat{p}^2] &=& p^2 + \frac{p(1-p)}{n} \\ E[\hat{p}^3] &=& p^3 + 3 \frac{p^2(1-p)}{n} \\ E[\hat{p}^4] &=& p^4 + 6 \frac{p^3(1-p)}{n} + 3 \frac{p^2(1-p)^2}{n^2} \\ \end{array}$$

and the variance can be written as

$$Var(\hat\theta) = \frac{2p^4-4p^3+2p^2}{n^2} + \frac{-4p^4+8p^3-5p^2+p}{n}$$

where the second term becomes dominant for large $n$ and is the same result as using the Fisher Information matrix

$$p(1-p)/n \cdot \left(\frac{\text{d}\theta}{\text{d}p}\right)^2 = \frac{p(1-p)(1-2p)^2}{n}$$

In the case of $p=0.5$ this would lead to zero variance (or an infinite value in the information matrix). In that case you can still use the Delta method with a second order derivative as demonstrated in this question: Implicit hypothesis testing: mean greater than variance and Delta Method