Solved – Central limit theorem for maximum likelihood estimators when modelling assumptions are violated

central limit theoremmaximum likelihoodmisspecificationmodelingquasi-likelihood

Lehman's Element's of Statistical Learning Theory gives in Theorem 7.5.2 a central limit theorem for multiparamter maximum likelihood estimators. (Many other sources provide similar theorems.) The theorem states that under certain technical conditions,

$$
\sqrt{n}(\theta^* – \theta_n) \rightarrow_L \mathcal{N}(0,I(\theta^*)^{-1})
$$

where $\theta^*$ is the true parameter vector, $\theta_n$ is the parameter vector estimated from $n$ samples, and $I(\theta^*)$ is the Fischer information matrix.

The technical conditions of these theorems never explicitly state that the model whose parameters we are trying to learn is a good model for the data in any sense. But the examples of using the theorem always assume this. For example, immediately after stating the theorem above, Lehman uses this theorem to prove that the mean and variance estimates of a normal distribution are themselves normally distributed. But the example assumes that the data points are normally distributed.

What if the data were actually exponentially distributed, but I make a horrible modelling assumption that the data is normally distributed?
Does the CLT for MLE still hold? In general, is there a way to characterize the distribution of the parameters that depends on how poor of a modelling assumption we've made?

Best Answer

It turns out that the sample mean and sample variance, which are the sufficient statistics for the two parameter normal, $\theta = [\mu, \sigma^2]$, are also unbiased consistent estimators of the mean and variance of any other distribution by the WLLN (provided those moments exist) and have a limiting normal distribution that is the same as that of the ML theory by the central limit theorem (when regularity conditions are met). So for your example, normal probability model imposed on exponentially distributed data, you are okay. The normal estimates are the method of moments estimators.

I don't think as much can be said in general. If the underlying data were shifted/scaled Cauchy, $\mathcal{T}_1(\delta)$, the mean is an inconsistent estimator of the shift parameter. If the underlying data were $\mathcal{Beta}(\alpha, \beta)$ and we imposed a 1 parameter model: $\mathcal{Beta}(\alpha^*, 1)$ the $\alpha^*$ would have some estimand and a limiting distribution around it that is not necessarily $\alpha$ but a function of $\beta$ and $\alpha$; this is a justification for quasilikelihood. If the underlying data were double exponential. The normal distribution would be unbiased in estimating the mean, but it would project a variance that is too small because the median is inefficient relative to the mean.

I think these three "counterexamples" show why quasilikelihood has a basis: most sufficient statistics are also method of moments statistics and have CLTs for other reasons, and estimate useful things. However, in ML theory, when you know the probability model, you get more efficient and powerful estimators nearly every time.