Hypothesis Testing – Why Wilks’ 1938 Proof Doesn’t Work for Misspecified Models

asymptoticshypothesis testinglikelihood-ratiomisspecificationmodel selection

In the famous 1938 paper ("The large-sample distribution of the likelihood ratio for testing composite hypotheses", Annals of Mathematical Statistics, 9:60-62), Samuel Wilks derived the asymptotic distribution of $2 \times LLR$ (log likelihood ratio) for nested hypotheses, under the assumption that the larger hypothesis is correctly specified. The limiting distribution is $\chi^2$ (chi-squared) with $h-m$ degrees of freedom, where $h$ is the number of parameters in the larger hypothesis and $m$ is the number of free parameters in the nested hypothesis. However, it is supposedly well-known that this result does not hold when the hypotheses are misspecified (i.e., when the larger hypothesis is not the true distribution for the sampled data).

Can anyone explain why? It seems to me that Wilks' proof should still work with minor modifications. It relies on the asymptotic normality of the maximum likelihood estimate (MLE), which still holds with misspecified models. The only difference is the covariance matrix of the limiting multivariate normal: for correctly specified models, we can approximate the covariance matrix with the inverse Fisher information matrix $J^{-1}$, with misspecification, we can use the sandwich estimate of the covariance matrix ($J^{-1} K J^{-1}$). The latter reduces to the inverse of the Fisher information matrix when the model is correctly specified (since $J = K$). AFAICT, Wilks proof doesn't care where the estimate of the covariance matrix comes from, as long as we have an invertible asymptotic covariance matrix of the multivariate normal for the MLEs ($c^{-1}$ in the Wilks paper).

Best Answer

R.V. Foutz and R.C. Srivastava has examined the issue in detail. Their 1977 paper "The performance of the likelihood ratio test when the model is incorrect" contains a statement of the distributional result in case of misspecification alongside a very brief sketch of the proof, while their 1978 paper "The asymptotic distribution of the likelihood ratio when the model is incorrect" contains the proof -but the latter is typed in old-fashioned type-writer (both papers use the same notation though, so you can combine them in reading). Also, for some steps of the proof they refer to a paper by K.P. Roy "A note on the asymptotic distribution of likelihood ratio" from 1957 which does not appear to be available on-line, even gated.

In case of distributional misspecification, if the MLE is still consistent and asymptotically normal (which is not always the case), the LR statistic follows asymptotically a linear combination of independent chi-squares (each of one degree of freedom)

$$-2\ln \lambda \xrightarrow{d} \sum_{i=1}^{r}c_i\mathcal \chi^2_i$$

where $r=h-m$. One can see the "similarity": instead of one chi-square with $h-m$ degrees of freedom, we have $h-m$ chi-squares each with one degree of freedom. But the "analogy" stops there, because a linear combination of chi-squares does not have a closed-form density. Each scaled chi-square is a gamma, but with a different $c_i$ parameter that leads to a different scale parameter for the gamma -and the sum of such gammas is not closed-form, although its values can be calculated.

For the $c_i$ constants, we have $c_1 \geq c_2\geq ...c_r \geq0$, and they are the eigenvalues of a matrix... which matrix? Well, using the authors notation, set $\Lambda$ to be the Hessian of the log-likelihood and $C$ to be the outer product of the gradient of the log-likelihood (in expectational terms). So $V = \Lambda^{-1} C (\Lambda')^{-1}$ is the asymptotic variance-covariance matrix of the MLE.

Then set $M$ to be the $r \times r$ upper diagonal block of $V$.

Also write $\Lambda$ in block form

$$\Lambda =\left [\begin {matrix} \Lambda_{r\times r} & \Lambda_2'\\ \Lambda_2 & \Lambda_3\\ \end{matrix}\right]$$

and set $W = -\Lambda_{r\times r}+\Lambda_2'\Lambda_3^{-1}\Lambda_2$ ($W$ is the negative of the Schur Complement of $\Lambda$).

Then the $c_i$'s are the eigenvalues of the matrix $MW$ evaluated at the true values of the parameters.

ADDENDUM
Responding to the valid remark of the OP in the comments (sometimes, indeed, questions become a springboard for sharing a more general result, and themselves may be neglected in the process), here is how Wilks's proof proceeds: Wilks starts with the joint normal distribution of the MLE, and proceeds to derive the functional expression of the Likelihood Ratio. Up to and including his eq. $[9]$, the proof can move forward even if we assume that we have a distributional misspecification: as the OP notes, the terms of the variance covariance matrix will be different in the misspecification scenario, but all Wilks does is take derivatives, and identify asymptotically negligible terms. And so he arrives at eq. $[9]$ where we see that the likelihood ratio statistic, if the specification is correct, is just the sum of $h-m$ squared standard normal random variables, and so they are distributed as one chi-square with $h-m$ degrees of freedom: (generic notation)

$$-2\ln \lambda = \sum_{i=1}^{h-m}\left(\sqrt n\frac{\hat \theta_i - \theta_i}{\sigma_i}\right)^2 \xrightarrow{d} \mathcal \chi^2_{h-m}$$

But if we have misspecification, then the terms that are used in order to scale the centered and magnified MLE $\sqrt n(\hat \theta -\theta)$ are no longer the terms that will make the variances of each element equal to unity, and so transform each term into a standard normal r.v and the sum into a chi-square.
And they are not, because these terms involve the expected values of the second derivatives of the log-likelihood... but the expected value can only be taken with respect to the true distribution, since the MLE is a function of the data and the data follows the true distribution, while the second derivatives of the log-likelihood are calculated based on the wrong density assumption.

So under misspecification we have something like $$-2\ln \lambda = \sum_{i=1}^{h-m}\left(\sqrt n\frac{\hat \theta_i - \theta_i}{a_i}\right)^2$$ and the best we can do is to manipulate it into

$$-2\ln \lambda = \sum_{i=1}^{h-m}\frac {\sigma_i^2}{a_i^2}\left(\sqrt n\frac{\hat \theta_i - \theta_i}{\sigma_i}\right)^2 = \sum_{i=1}^{h-m}\frac {\sigma_i^2}{a_i^2}\mathcal \chi^2_1$$

which is a sum of scaled chi-square r.v.'s, no longer distributed as one chi-square r.v. with $h-m$ degrees of freedom. The reference provided by the OP is indeed a very clear exposition of this more general case that includes Wilks' result as a special case.

Related Question