Let's say we have $X_1,\ldots, X_n$ iid Bernoulli($p$), ask for MLE for $p$. I'm pretty struggled on the second derivative of log-likelihood function, why it is negative? My second question is what is MLE when the maximum is achieved on the boundary of the parameter space : ${\sum x_i}=0$ or $n$? Looking forward to any feedback and suggestions.
Solved – Maximum Likelihood Estimation for Bernoulli distribution
bernoulli-distributionmaximum likelihood
Related Solutions
So, I'll modify your problem slightly to avoid dealing with boundary issues. Instead of your constraint $\theta > 0$, I'll replace it with $\theta \geq 0$.
You want to maximize the likelihood subject to $\theta \geq 0$.
After taking the logarithm of your likelihood and ignoring constant terms, we get the problem:
$$ \min_{\theta} f(\theta) \text{ s.t. } \theta \geq 0$$
where
$$f(\theta) := \sum_{i=1}^n (X_i - \theta)^2.$$
You are correct that if we didn't have the constraint, we could simply differentiate the objective function and get $\theta^{unconstrained} := \bar{X}$.
However, due to the constraint, we can't just differentiate. So, let us consider the two cases separately:
If $\theta^{unconstrained}$ is positive, then it is also the solution for your constrained MLE problem (the additional constraint can only increase the value of the minimization problem above).
If $\theta^{unconstrainted} < 0$, then it doesn't satisfy your constraint and is not feasible. However, you can check for yourself (a bit of algebra) that $f(\theta) \geq f(0)$ for all $\theta \geq 0$ when $\theta^{constrainted} < 0$. Therefore, $\theta^0=0$ minimizes $f(\theta)$ over $\theta \geq 0$.
So for this problem, the MLE for $\theta$ is $\theta^{ML} = \max{(\bar{X}, 0)}$. And using the equivariance property, the MLE of $\sqrt{\theta}$ is $\sqrt{\theta^{ML}}$.
Considered as a function of its parameters $\theta$ for fixed data $x$, the likelihood evaluates the probability that that you would observe the data at some values for $\theta$ under the assumption that they are drawn from this particular model.
The likelihood function $L(\theta|x)$ represents the joint probability of observing all of these data under the model $f(x|\theta)$. When you find a value of $\theta$ for which this joint probability is the global maximum, we reason that this is the parameter value that most likely gave rise to the data. For some simple models, we can define the MLE estimate in closed form using calculus.
We start by writing the likelihood function $L(\theta|x)=f(\theta|x)$. For the exponential case, $L(\theta|x)=\prod_{i=1}^n \theta\exp(-\theta x_i)$. However, the product operator makes evaluation very messy and difficult. Most people deal with the log-likelihood because it transforms the product into a sum and because the logarithm attains its maximum at the same place: $\ln(L(\theta|x))=\sum_{i=1}^n (\ln(\theta)-\theta x_i).$ Now comes the fun part: taking the derivative of the log-likelihood with respect to $\theta$. This helpful because when the derivative is zero, we know we have found either a maximum or a minimum. $\frac{d \ln(L)}{d \theta}=\sum_{i=1}^n (\frac {1}{\theta}- x_i)$. Manipulating the sigma operator, we find $\frac{d \ln(L)}{d \theta}=\frac{n}{\theta}+ \sum_{i=1}^n x_i$. Setting the derivative equal to zero and solving for $\theta$: $0=\frac {n}{\theta}-\sum_{i=1}^n x_i$, which can be manipulated to show $\frac{1}{\hat \theta}=\frac {1}{n}\sum_{i=1}^nx_i$.
To confirm that this is a maximum, we can check the second derivatives using standard results from calculus, or look at a plot of various values of $(\theta, \ln(L))$ in the neighborhood of $\hat \theta$. For this problem, I believe there is only one solution for $\hat \theta$, so we don't have to worry about proving that this is the global maximum.
I hope this helps. The same procedures should help you with the Bernoulli problem, but I am less familiar with that process so I would not want to speak out of turn. This is only a cursory treatment of the reasoning process of MLE. I would highly recommend Gary King's book Unifying Political Methodology, which contains a very thorough, very accessible explanation of the MLE procedure.
@Jonas, is there anything that could be made more clear in my answer? I notice you marked it as "correct" and then reversed that.
Best Answer
Its often easier to work with the log-likelihood in these situations than the likelihood. Note that the minimum/maximum of the log-likelihood is exactly the same as the min/max of the likelihood. $$ \begin{align*} L(p) &= \prod_{i=1}^n p^{x_i}(1-p)^{(1-x_i)}\\ \ell(p) &= \log{p}\sum_{i=1}^n x_i + \log{(1-p)}\sum_{i=1}^n (1-x_i)\\ \dfrac{\partial\ell(p)}{\partial p} &= \dfrac{\sum_{i=1}^n x_i}{p} - \dfrac{\sum_{i=1}^n (1-x_i)}{1-p} \overset{\text{set}}{=}0\\ \sum_{i=1}^n x_i - p\sum_{i=1}^n x_i &= p\sum_{i=1}^n (1-x_i)\\ p& = \dfrac{1}{n}\sum_{i=1}^n x_i\\ \dfrac{\partial^2 \ell(p)}{\partial p^2} &= \dfrac{-\sum_{i=1}^n x_i}{p^2} - \dfrac{\sum_{i=1}^n (1-x_i)}{(1-p)^2} \end{align*} $$
The penultimate line gives us the MLE (the $p$ that satisfies the first derivative of the log-likelihood (also called the score function) equal to zero).
The last equation gives us the second derivative of the log-likelihood. Since $p\in [0,1]$ and $x_i \in \left\{0,1\right\}$, the second derivative is negative.