Maximum Likelihood – Taylor Series Expansion of Maximum Likelihood Estimator, Newton-Raphson, Fisher Scoring, and MLE Distribution by Delta Method

distributionsmaximum likelihoodstatistical significance

Assume $\ell\left(\theta\right)$ is the log-likelihood of parameter vector $\theta$ and $\widehat{\theta}$ is the maximum likelihood estimator of $\theta$ then the Taylor series of $\ell\left(\theta\right)$ about $\widehat{\theta}$
is
\begin{align*}
\ell\left(\theta\right) & \approxeq\ell\left(\widehat{\theta}\right)+\frac{\partial\ell\left(\theta\right)}{\partial\theta}\Bigr|_{\theta=\widehat{\theta}}\left(\theta-\widehat{\theta}\right)+\frac{1}{2}\left(\theta-\widehat{\theta}\right)^{\prime}\frac{\partial^{2}\ell\left(\theta\right)}{\partial\theta\partial\theta^{\prime}}\Bigr|_{\theta=\widehat{\theta}}\left(\theta-\widehat{\theta}\right)\\
\frac{\ell\left(\theta\right)}{\partial\theta} & \approxeq\mathbf{0}+\left(\mathbf{1}-\mathbf{0}\right)\frac{\partial\ell\left(\theta\right)}{\partial\theta}\Bigr|_{\theta=\widehat{\theta}}+\frac{\partial^{2}\ell\left(\theta\right)}{\partial\theta\partial\theta^{\prime}}\Bigr|_{\theta=\widehat{\theta}}\left(\theta-\widehat{\theta}\right)\quad\overset{\textrm{set}}{=}\quad\mathbf{0}\\
\\
\theta-\widehat{\theta} & =-\left[\frac{\partial^{2}\ell\left(\theta\right)}{\partial\theta\partial\theta^{\prime}}\Bigr|_{\theta=\widehat{\theta}}\right]^{-}\left[\frac{\partial\ell\left(\theta\right)}{\partial\theta}\Bigr|_{\theta=\widehat{\theta}}\right]\\
\widehat{\theta}-\theta & =\left[\frac{\partial^{2}\ell\left(\theta\right)}{\partial\theta\partial\theta^{\prime}}\Bigr|_{\theta=\widehat{\theta}}\right]^{-}\left[\frac{\partial\ell\left(\theta\right)}{\partial\theta}\Bigr|_{\theta=\widehat{\theta}}\right]\\
\widehat{\theta}-\theta & =\left[\mathbb{H}\left(\theta\right)\Bigr|_{\theta=\widehat{\theta}}\right]^{-}\left[\mathbb{S}\left(\theta\right)\Bigr|_{\theta=\widehat{\theta}}\right]
\end{align*}

As

$
\theta=\widehat{\theta}-\left[\mathbb{H}\left(\theta\right)\Bigr|_{\theta=\widehat{\theta}}\right]^{-}\left[\mathbb{S}\left(\theta\right)\Bigr|_{\theta=\widehat{\theta}}\right]
$

So

\begin{align*}
\theta^{\left(m+1\right)} & =\theta^{\left(m\right)}-\left[\mathbb{H}\left(\theta^{\left(m\right)}\right)\right]^{-}\mathbb{S}\left(\theta^{\left(m\right)}\right)\quad\quad\left({\textrm{Newton-Raphson}}\right)\\
\\
\\
\theta^{\left(m+1\right)} & =\theta^{\left(m\right)}-\left[\mathbb{E}\left\{ \mathbb{H}\left(\theta^{\left(m\right)}\right)\right\} \right]^{-}\mathbb{S}\left(\theta^{\left(m\right)}\right)\\
\theta^{\left(m+1\right)} & =\theta^{\left(m\right)}+\left[\mathbb{I}\left(\theta^{\left(m\right)}\right)\right]^{-}\mathbb{S}\left(\theta^{\left(m\right)}\right)\quad\quad\left(\textrm{Fisher Scoring}\right)
\end{align*}

Questions

  1. I'm not sure that my derivation is correct or not. I have seen at least two different versions of derivation.
  2. What would be the mean, variance and distribution of $\widehat{\theta}$ (Might by Delta method)?

Best Answer

I will denote $\hat \theta$ the maximum likelihood estimator, while $\theta^{\left(m+1\right)}$ and $\theta^{\left(m\right)}$ are any two vectors. $\theta_0$ will denote the true value of the parameter vector. I am suppressing the appearance of the data.

The (untruncated) 2nd-order Taylor expansion of the log-likelihood viewed as a function of $\theta^{\left(m+1\right)}$, $\ell\left(\theta^{\left(m+1\right)}\right)$, centered at $\theta^{\left(m\right)}$ is (in a bit more compact notation than the one used by the OP)

$$\begin{align} \ell\left(\theta^{\left(m+1\right)}\right) =& \ell\left(\theta^{\left(m\right)}\right)+\frac{\partial\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)\\ +&\frac{1}{2}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)^{\prime}\frac{\partial^{2}\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta\partial\theta^{\prime}}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)\\ +&R_2\left(\theta^{\left(m+1\right)}\right) \\\end{align}$$ The derivative of the log-likelihood is (using the properties of matrix differentiation)

$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right) = \frac{\partial\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta} +\frac{\partial^{2}\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta\partial\theta^{\prime}}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right) +\frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right) $$

Assume that we require that $$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right)- \frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right)=\mathbf 0$$

Then we obtain $$\theta^{\left(m+1\right)}=\theta^{\left(m\right)}-\left[\mathbb{H}\left(\theta^{\left(m\right)}\right)\right]^{-1}\left[\mathbb{S}\left(\theta^{\left(m\right)}\right)\right]$$

This last formula shows how the value of the candidate $\theta$ vector is updated in each step of the algorithm. And we also see how the updating rule was obtained:$\theta^{\left(m+1\right)}$ must be chosen so as its total marginal effect on the log-likelihood equals its marginal effect on the Taylor remainder. In this way we "contain" how much the derivative of the log-likelihood strays away from the value zero.

If (and when) it so happens that $\theta^{\left(m\right)} = \hat \theta$ we will obtain

$$\theta^{\left(m+1\right)}=\hat \theta-\left[\mathbb{H}\left(\hat \theta\right)\right]^{-1}\left[\mathbb{S}\left(\hat \theta\right)\right]= \hat \theta-\left[\mathbb{H}\left(\hat \theta\right)\right]^{-1}\cdot \mathbf 0 = \hat \theta$$

since by construction $\hat \theta$ makes the gradient of the log-likelihood zero. This tells us that once we "hit" $\hat \theta$, we are not going anyplace else after that, which, in an intuitive way, validates our decision to essentially ignore the remainder, in order to calculate $\theta^{\left(m+1\right)}$. If the conditions for quadratic convergence of the algorithm are met, we have essentially a contraction mapping, and the MLE estimate is the (or a) fixed point of it. Note that if $\theta^{\left(m\right)} = \hat \theta$ then the remainder becomes also zero and then we have $$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right)- \frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right)=\frac{\partial}{\partial \theta}\ell\left(\hat \theta\right)=\mathbf 0$$

So our method is internally consistent.

DISTRIBUTION OF $\hat \theta$
To obtain the asymptotic distribution of the MLE estimator we apply the Mean Value theorem according to which, if the log-likelihood is continuous and differentiable, then

$$\frac{\partial}{\partial \theta}\ell\left(\hat \theta\right) = \frac{\partial\ell\left(\theta_0\right)}{\partial\theta} +\frac{\partial^{2}\ell\left(\bar \theta\right)}{\partial\theta\partial\theta^{\prime}}\left(\hat \theta-\theta_0\right) = \mathbf 0$$

where $\bar \theta$ is a mean value between $\hat \theta$ and $\theta_0$. Then

$$\left(\hat \theta-\theta_0\right) = -\left[\mathbb{H}\left(\bar \theta\right)\right]^{-1}\left[\mathbb{S}\left( \theta_0\right)\right]$$

$$\Rightarrow \sqrt n\left(\hat \theta-\theta_0\right) = -\left[\frac 1n\mathbb{H}\left(\bar \theta\right)\right]^{-1}\left[\frac 1{\sqrt n}\mathbb{S}\left( \theta_0\right)\right]$$

Under the appropriate assumptions, the MLE is a consistent estimator. Then so is $\bar \theta$, since it is sandwiched between the MLE and the true value. Under the assumption that our data is stationary, and one more technical condition (a local dominance condition that guarantees that the expected value of the supremum of the Hessian in a neighborhood of the true value is finite) we have $$\frac 1n\mathbb{H}\left(\bar \theta\right) \rightarrow_p E\left[\mathbb{H}\left(\theta_0\right)\right]$$

Moreover, if interchange of integration and differentiation is valid (which usually will be), then $$E\left[\mathbb{S}\left( \theta_0\right)\right]=\mathbf 0$$ This, together with the assumption that our data is i.i.d, permits us to use the Lindeberg-Levy CLT and conclude that $$\left[\frac 1{\sqrt n}\mathbb{S}\left( \theta_0\right)\right] \rightarrow_d N(\mathbf 0, \Sigma),\qquad \Sigma = E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]$$

and then, by applying Slutzky's Theorem, that $$\Rightarrow \sqrt n\left(\hat \theta-\theta_0\right) \rightarrow_d N\left(\mathbf 0, \operatorname{Avar}\right)$$

with

$$\operatorname{Avar} = \Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1}\cdot \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)\cdot \Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1}$$

But the information matrix equality states that

$$-\Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big) = \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)$$

and so $$\operatorname{Avar} = -\Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1} = \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)^{-1}$$

Then for large samples the distribution of $\hat \theta$ is approximated by

$$\hat \theta \sim _{approx} N\left(\theta_0, \frac 1n\operatorname {\widehat Avar}\right)$$

for a consistent estimator for $\operatorname {\widehat Avar}$ (the sample analogues of the expected values involved are such consistent estimators).