Solved – Akaike Information Criterion (AIC) derivation

I am trying to understand the Akaike Information Criterion (AIC) derivation and this resource explains it quite well, although there are some mysteries for me.

First of all, it considers $\hat{\theta}$ as the parameters resulting from Maximum Likelihood Estimation (MLE) and it says the difference from the true model can be computed using the Kullback-Leibler distance:

$$\int p(y) \log p(y) dy – \int p(y) \log \hat{p}_j(y) dy$$

Minimising such a distance is equivalent to maximising the second term referred to as $K$.
One trivial estimation of $K$ estimation is

$$\bar{K} = \frac{1}{N} \sum_{i=1}^N \log p(Y_i, \hat{\theta}) = \frac{\ell_j(\hat{\theta})}{N}$$

Suppose $\theta_0$ minimises $K$ and let

$$s(y,\theta) = \frac{\partial \log p (y, \theta)}{\partial \theta}$$

be the score and $H(y,\theta)$ the matrix of second derivatives.

The author later in the proof uses the fact the score has $0$ mean: based on what?

Then it says: let $$Z_n = \sqrt{n} (\hat{\theta} – \theta_0)$$

and recall that $$Z_n\rightarrow \mathcal{N}(0, J^{-1}VJ^{-1})$$

where
$$J = -E[H(Y,\theta_0)]$$

and
$$V= Var(s(Y, \theta_0)$$.

Why $$Z_n = \sqrt{n} (\hat{\theta} – \theta_0)$$? Where does it come from?

Then let

$$S_n = \frac{1}{n} \sum_{i=1}^Ns(Y_i, \theta_0)$$

It says that by the Central limit theorem
$$\sqrt{n}S_n \rightarrow \mathcal{N}(0,V)$$

$V$ comes from the definition but why $0$ mean? Where does it come from?
At some point it says:
$$J_n = -\frac{1}{n}\sum_{i=1}^NH(Y_i, \theta_0) – \xrightarrow{P} J$$
What's the meaning of $- \xrightarrow{P} J$?

EDIT

Additional question.
Defining
$$K_0 = \int p(y) \log p(y, \theta_0) dy $$

and
$$A_N = \frac{1}{N} \sum_{i=1}^N(\ell(Y_i,\theta_0)-K_0)$$
Why $$E[A_N] =0$$?

Best Answer

Consider scalar parameters $\theta_0$ and the corresponding scalar estimate $\hat \theta$ for simplicity.

I will answer Q1 and Q3 which are essentially asking why is the mean of the score function $\Bbb{E}_{\theta}(s(\theta)) =0 $. This is a widely known result.. To put it simply, Notice that score function $s(\theta)$ depends of the random observations $X$. We can take its expectation as follows:

\begin{align} \Bbb{E}_{\theta}(s) & = \int_x f(x;\theta) \frac{\partial \log f(x;\theta)}{\partial \theta} dx \\ &=\int_x \frac{\partial f(x;\theta)}{\partial \theta} dx = 0 \qquad \text{(exchanging integral and derivative)} \end{align}

Now, notice that $S_n$ is nothing but averaged-sum of score functions based on independent observations. Hence, its expectation will also be zero.

For Q2) the motivation is to find study the asymptotic properties of our estimator wrt to the true parameter. Let $\hat{\theta}$ be the maximizer of $L_{n}(\theta)=\frac{1}{n} \sum_{i=1}^{n} \log f\left(X_{i} | \theta\right)$. Now, by meanvalue theorem \begin{align} 0=L_{n}^{\prime}(\hat{\theta}) & =L_{n}^{\prime}\left(\theta_{0}\right)+L_{n}^{\prime \prime}\left(\hat{\theta}_{1}\right)\left(\hat{\theta}-\theta_{0}\right) \quad \text{(for some $\theta_1 \in [\hat\theta,\theta_0]$)}\\ \implies & \left(\hat{\theta}-\theta_{0}\right) = \frac{L_{n}^{\prime}\left(\theta_{0}\right)}{L_{n}^{\prime \prime}\left(\hat{\theta}_{1}\right)} \end{align}

Consider the numerator: \begin{align} \sqrt{n}\left(\frac{1}{n} \sum_{i=1}^{n} l^{\prime}\left(X_{i} | \theta_{0}\right)-\mathbb{E}_{\theta_{0}} l^{\prime}\left(X_{1} | \theta_{0}\right)\right) & = \sqrt{n}(S_n - \Bbb{E}(S_n)) \\ & \rightarrow N\left(0, \operatorname{Var}_{\theta_{0}}\left(l^{\prime}\left(X_{1} | \theta_{0}\right)\right)\right) = N(0,V) \end{align}

Now, the denominator $L^{''}_n$ coverges to the Fisher's information $(J)$ by LLN. Therefore, for the scalar paramters case, we can see that $$\sqrt{n}(\hat \theta - \theta_0) \rightarrow N(0,\frac{V}{J^2})$$

Best Answer

Related Solutions

Solved – Akaike Information criterion for k-means

Proof of consistency of Maximum Likelihood Estimator

Related Question