Hypothesis Testing – Choosing Between Likelihood Ratio, Score, and Wald Tests

hypothesis testinglikelihood-ratiorregression

From what I've been reading, amongst others on the site of the UCLA statistics consulting group likelihood ratio tests and Wald tests are pretty similar in testing whether two glm models show a significant difference in the fit for a dataset (excuse me if my wording might be a bit off). In essence I can compare two models and test if the second model shows a significantly better fit than the first, or there is no difference between the models.

So the LR and Wald tests should show the same ballpark p-values for the same regression models. At least the same conclusion should come out.

Now I did both tests for the same model in R and get widely differing results.
Here are results from R for one model:

lrtest(glm(data$y ~ 1), glm(data$y ~ 
 data$site_name, family="poisson"))
Likelihood ratio test

Model 1: data$y ~ 1
Model 2: data$y ~ data$site_name
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   2 -89.808                         
2   9 -31.625  7 116.37  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   

 lrtest(glm(data$y ~ 1, family="poisson"), 
  glm(data$y ~ data$site_name, family="poisson"))
Likelihood ratio test

Model 1: data$y ~ 1
Model 2: data$y ~ data$site_name
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -54.959                         
2   9 -31.625  8 46.667  1.774e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 waldtest(glm(data$y ~ data$site_name, 
  family="poisson"))
Wald test

Model 1: data$y ~ data$site_name
Model 2: data$y ~ 1
  Res.Df Df      F Pr(>F)
1     45                 
2     53 -8 0.7398 0.6562   

 waldtest(glm(data$y ~ 1, family="poisson"), 
 glm(data$y~data$site_name, family="poisson")) 

Wald test

Model 1: data$y ~ 1
Model 2: data$y ~ data$site_name
  Res.Df Df      F Pr(>F)
1     53                 
2     45  8 0.7398 0.6562

About the data, data\$y contains count data and data\$site_name is a factor with 9 levels. There are 54 values in data\$y, with 6 values per level of data\$site_name.

Here are frequency distributions:

table(data$y)

 0  2  4  5  7 
50  1  1  1  1 
table(data$y, data$site_name)
   
    Andulay Antulang Basak Dauin Poblacion District 1 Guinsuan Kookoo's Nest Lutoban Pier Lutoban South Malatapay Pier
  0       6        6     6                          4        6             6            6             5              5
  2       0        0     0                          0        0             0            0             1              0
  4       0        0     0                          1        0             0            0             0              0
  5       0        0     0                          0        0             0            0             0              1
  7       0        0     0                          1        0             0            0             0              0

Now this data doesn't fit the Poisson distribution very well due to the enormous over-dispersion of zero counts. But with another model, where data\$y>0 fits the Poisson model quite well, and while using a zero-inflated Poisson model, I still get highly different Wald test and lrtest results. There the Wald test shows a p-value of 0.03 while the lrtest has a p-value 0.0003. Still a factor 100 difference, even though the conclusion might be the same.

So what am I understanding incorrectly here with the likelihood ratio vs waldtest?

Best Answer

It's important to note that although the likelihood ratio test and the Wald test are used by researchers to accomplish the same empirical goal(s), they are testing different hypotheses. The likelihood ratio test evaluates whether the data were likely to have come from a more complex model, vs. a more simple model. Put another way, does the addition of a particular effect allow the model to account for more information. The Wald test, conversely, evaluates whether it is likely that the estimated effect could be zero. It's a nuanced difference, to be sure, but an important conceptual difference nonetheless.

Agresti (2007) contrasts likelihood ratio testing, Wald testing, and a third method called the "score test" (he hardly elaborates on this test further). From his book (p. 13):

When the sample size is small to moderate, the Wald test is the least reliable of the three tests. We should not trust it for such a small n as in this example (n = 10). Likelihood-ratio inference and score-test based inference are better in terms of actual error probabilities coming close to matching nominal levels. A marked divergence in the values of the three statistics indicates that the distribution of the ML estimator may be far from normality. In that case, small-sample methods are more appropriate than large-sample methods.

Looking at your data and output, it seems that you do indeed have a relatively small sample, and therefore may want to place greater stock in the likelihood ratio test results vs. the Wald test results.

References

Agresti, A. (2007). An introduction to categorical data analysis (2nd edition). Hoboken, NJ: John Wiley & Sons.