Maximum Likelihood – Is MLE Estimation Asymptotically Normal and Efficient for Incorrect Models?

asymptoticsmaximum likelihoodmodel

Premise: this may be a stupid question. I only know the statements about MLE asymptotic properties, but I never studied the proofs. If I did, maybe I woulnd't be asking these questions, or I maybe I would realize these questions don't make sense…so please go easy on me 🙂

I've often seen statements which say that the MLE estimator of a model's parameters is asymptotically normal and efficient. The statement is usually written as

$\hat{\theta}\xrightarrow[]{d}\mathcal{N}(\theta_0,\mathbf{I}(\theta_0)^{-1})$ as $N\to\infty$

where $N$ is the number of samples, $\mathbf{I}$ is Fisher information and $\theta_0$ is the parameter (vector) true value. Now, since there is reference to a true model, does this mean that the result will not hold if the model is not true?

Example: suppose I model power output from a wind turbine $P$ as a function of wind speed $V$ plus additive Gaussian noise

$P=\beta_0+\beta_1V+\beta_2V^2+\epsilon$

I know the model is wrong, for at least two reasons: 1) $P$ is really proportional to the third power of $V$ and 2) the error is not additive, because I neglected other predictors which are not uncorrelated with wind speed (I also know that $\beta_0$ should be 0 because at 0 wind speed no power is generated, but that's not relevant here). Now, suppose I have a infinite database of power and wind speed data from my wind turbine. I can draw as many samples i want, of whatever size. Suppose I draw 1000 samples, each of size 100, and compute $\hat{\boldsymbol{\beta}}_{100}$, the MLE estimate of $\boldsymbol{\beta}=(\beta_0,\beta_1,\beta_2)$ (which under my model would just be the OLS estimate). I thus have 1000 samples from the distribution of $\hat{\boldsymbol{\beta}}_{100}$. I can repeat the exercise with $N=500,1000,1500,\dots$. As $N\to\infty$, should the distribution of $\hat{\boldsymbol{\beta}}_{N}$ tend to be asymptotically normal, with the stated mean and variance? Or does the fact that model is incorrect invalidate this result?

The reason I'm asking is that rarely (if ever) model are "true" in applications. If the asymptotic properties of MLE are lost when the model is not true, then it might make sense to use different estimation principles, which while less powerful in a setting where the model is correct, may perform better than MLE in other cases.

EDIT: it was noted in the comments that the notion of true model can be problematic. I had the following definition in mind: given a family of models $f_{\boldsymbol{\theta}}(x)$ indicized by the parameter vector $\boldsymbol{\theta}$, for each model in the family you can always write

$Y=f_{\boldsymbol{\theta}}(X)+\epsilon$

by simply defining $\epsilon$ as $Y-f_{\boldsymbol{\theta}}(X)$. However, in general the error won't be orthogonal to $X$, have mean 0, and it won't necessarily have the distribution assumed in the derivation of the model. If there exists a value $\boldsymbol{\theta_0}$ such that $\epsilon$ has these two properties, as well as the assumed distribution, I would say the model is true. I think this is directly related to saying that $f_{\boldsymbol{\theta_0}}(X)=E[Y|X]$, because the error term in the decomposition

$Y=E[Y|X]+\epsilon$

has the two properties mentioned above.

Best Answer

I don't believe there is a single answer to this question.

When we consider possible distributional misspecification while applying maximum likelihood estimation, we get what is called the "Quasi-Maximum Likelihood" estimator (QMLE). In certain cases the QMLE is both consistent and asymptotically normal.

What it loses with certainty is asymptotic efficiency. This is because the asymptotic variance of $\sqrt n (\hat \theta - \theta)$ (this is the quantity that has an asymptotic distribution, not just $\hat \theta$) is, in all cases,

$$\text{Avar}[\sqrt n (\hat \theta - \theta)] = \text{plim}\Big( [\hat H]^{-1}[\hat S \hat S^T][\hat H]^{-1}\Big) \tag{1}$$

where $H$ is the Hessian matrix of the log-likelihood and $S$ is the gradient, and the hat indicates sample estimates.

Now, if we have correct specification, we get, first, that

$$\text{Avar}[\sqrt n (\hat \theta - \theta)] = (\mathbb E[H_0])^{-1}\mathbb E[S_0S_0^T](\mathbb E[H_0])^{-1} \tag{2}$$

where the "$0$" subscript denotes evaluation at the true parameters (and note that the middle term is the definition of Fisher Information), and second, that the "information matrix equality" holds and states that $-\mathbb E[H_0] = \mathbb E[S_0S_0^T]$, which means that the asymptotic variance will finally be

$$\text{Avar}[\sqrt n (\hat \theta - \theta)] = -(\mathbb E[H_0])^{-1} \tag{3}$$

which is the inverse of the Fisher information.

But if we have misspecification, expression $(1)$ does not lead to expression $(2)$ (because the first and second derivatives in $(1)$ have been derived based on the wrong likelihood). This in turn implies that the information matrix inequality does not hold, that we do not end up in expression $(3)$, and that the (Q)MLE does not attain full asymptotic efficiency.

Related Question