Solved – Expected value of maximum likelihood coin parameter estimate

expected valuemaximum likelihoodprobabilityself-studyvariance

Suppose I have a coin toss experiment in which I want to calculate the maximum likelihood estimate of the coin parameter $p$ when tossing the coin $n$ times. After calculating the derivative of the binomial likelihood function $ L(p) = { n \choose x } p^x (1-p)^{n-x} $, I get the optimal value for $p$ to be $p^{*} = \frac{x}{n}$, with $x$ being the number of successes.

My questions now are:

How would I calculate the expected value/variance of this maximum likelihood estimate for $p$?
Do I need to calculate the expected value/variance for $L(p^{*})$?
If yes, how would I do that?

Best Answer

First of all this is a self-study question, so I'm going to go too much into each and every little technical detail, but I'm not going on a derivation frenzy either. There are many ways to do this. I'll help you by using general properties of the maximum likelihood estimator.

Background information

In order to solve your problem I think you need to study maximum likelihood from the beginning. You are probably using some kind of text book, and the answer should really be there somewhere. I'll help you find out what to look for.

Maximum Likelihood is an estimation method which is basically what we call an M-estimator (think of the "M" as "maximize/minimize"). If the conditions required for using these methods are satisfied, we can show that the parameter estimates are consistent and asymptotically normally distributed, so we have:

$$ \sqrt{N}(\hat\theta-\theta_0)\overset{d}{\to}\text{Normal}(0,A_0^{-1}B_0A_0^{-1}), $$

where $A_0$ and $B_0$ are some matrices. When using maximum likelihood we can show that $A_0=B_0$, and thus we have a simple expression: $$ \sqrt{N}(\hat\theta-\theta_0)\overset{d}{\to}\text{Normal}(0,A_0^{-1}). $$ We have that $A_0\equiv -E(H(\theta_0))$ where $H$ denotes the hessian. This is what you need to estimate in order to get your variance.

Your specific problem

So how do we do it? Here let's call our parameter vector $\theta$ what you do: $p$. This is just a scalar, so our "score" is just the derivative and the "hessian" is just the second order derivative. Our likelihood function can be written as: $$ l(p)=(p)^x (1-p)^{n-x}, $$ which is what we want to maximize. You used the first derivative of this or the log likelihood to find your $p^*$. Instead of setting the first derivative equal to zero, we can differentiate again, to find the second order derivative $H(p)$. First we take logs: $$ ll(p)\equiv\log(l(p))=x\log(p)+(n-x)\log(1-p) $$ Then our 'score' is: $$ ll'(p)=\frac{x}{p}+\frac{n-x}{1-p}, $$ and our 'hessian': $$ H(p)=ll''(p)=-\frac{x}{p^2}-\frac{n-x}{(1-p)^2}. $$ Then our general theory from above just tells you to find $(-E(H(p)))^{-1}$. Now you just have to take the expectation of $H(p)$ (Hint: use $E(x/n)=p$), multiply by $-1$ and take the inverse. Then you'll have your variance of the estimator.

Related Solutions

Solved – Expected value of coin tossed twice

Only two coins, so it's easy to see all combinations, each with 1/4 probability:

Expected value of random variable $X$ denoted as $E[X]$ is a sum of results, multiplied by their probabilities. I.e.,

$E[g] = \frac{1}{4}*0+\frac{1}{4}*1+\frac{1}{4}*1+\frac{1}{4}*2$

Now, variance is

$E[X^2] - (E[X])^2$

I believe it is an easy task to go from these definitions to the answer of your question.

Maximum Likelihood – Why Use Maximum Likelihood Instead of Expected Likelihood?

The method proposed (after normalizing the likelihood to be a density) is equivalent to estimating the parameters using a flat prior for all the parameters in the model and using the mean of the posterior distribution as your estimator. There are cases where using a flat prior can get you into trouble because you don't end up with a proper posterior distribution so I don't know how you would rectify that situation here.

Staying in a frequentist context, though, the method doesn't make much sense since the likelihood doesn't constitute a probability density in most contexts and there is nothing random left so taking an expectation doesn't make much sense. Now we can just formalize this as an operation we apply to the likelihood after the fact to obtain an estimate but I'm not sure what the frequentist properties of this estimator would look like (in the cases where the estimate actually exists).

Advantages:

This can provide an estimate in some cases where the MLE doesn't actually exist.
If you're not stubborn it can move you into a Bayesian setting (and that would probably be the natural way to do inference with this type of estimate). Ok so depending on your views this may not be an advantage - but it is to me.

Disadvantages:

This isn't guaranteed to exist either.
If we don't have a convex parameter space the estimate may not be a valid value for the parameter.
The process isn't invariant to reparameterization. Since the process is equivalent to putting a flat prior on your parameters it makes a difference what those parameters are (are we talking about using $\sigma$ as the parameter or are we using $\sigma^2$)