Statistical Test Equivalence – Are Likelihood Ratio, Wald, and Score Tests Equivalent?

generalized linear modellikelihood-ratiorregressionwald test

In Foundations of Linear and Generalized Linear Models, Agresti makes a comment on page 131 about likelihood ratio, Wald, and Score testing of regression parameters.

For the best-known GLM, the normal linear model, the three types of
inference provide identical results.

I tried this out in R to see what would happen, and I got different p-values when I did my own likelihood ratio test versus the default printout in "summary()" that uses Wald, so something about my interpretation of Agresti's comment is incorrect.

set.seed(2020)
N <- 100
x <- rbinom(N, 1, 0.5)
err <- rnorm(N)
y <- 0.5*x + err
G0 <- glm(y ~ 1, family="gaussian")
G1 <- glm(y ~ x, family="gaussian")
test_stat <- summary(G0)$deviance - 
    summary(G1)$deviance
df <- dim(summary(G1)$coefficients)[1] - 
    dim(summary(G0)$coefficients)[1]
p.value <- 1-pchisq(test_stat, df)
p.value
summary(G1)$coefficients[2, 4]

However, I did a simulation of many repetitions to check for long-run performance, and the results are about the same.

set.seed(2020)
N <- 100 # sample size
R <- 1000 # number of simulations
alpha <- 0.05
lrt_r <- wld_r <- rep(0,R)
for (i in 1:R){
    x <- rbinom(N, 1, 0.5)
    err <- rnorm(N)
    y <- 0.5*x + err
    G0 <- glm(y ~ 1, family="gaussian") 
                # intercept-only model
    G1 <- glm(y ~ x, family="gaussian") 
           # model with x as a predictor
    test_stat <- summary(G0)$deviance - 
    summary(G1)$deviance
    df <- dim(summary(G1)$coefficients)[1] - 
       dim(summary(G0)$coefficients)[1]
    
    lr <- 1-pchisq(test_stat, df) 
        # likelihood ratio test p-value
    wd <- summary(G1)$coefficients[2, 4] 
        # Wald test p-value
    
    # check if the p-values warrant rejection at the level of alpha
    #
    if (lr <= alpha){lrt_r[i] <- 1}
    if (wd <= alpha){wld_r[i] <- 1}
}

# Check the power of each test
#
sum(lrt_r)/R*100 # 70.4%
sum(wld_r)/R*100 # 69.9%

This is close enough to suggest to me that the difference is due to a finite number of repetitions and/or something about that particular 2020 seed (though seeds 1 and 7 also give likelihood ratio testing slightly higher power, which I find suspicious).

Is that what's going on in Agresti's quote, that the three methods may not give identical results on any particular data set but will have the same long-run performance on many samples drawn from the same population?

(I didn't address score testing here, and I am content to prioritize Wald versus likelihood ratio testing.)

Reference

Agresti, Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.

Best Answer

Exact equivalence only holds if the error variance is known, see Exact equivalence of LR and Wald in linear regression under known error variance. Else, Wald, likelihood ratio and Lagrange multiplier are related via $W\geq LR\geq LM$ in a normal likelihood framework and equivalence only obtains asymptotically, as illustrated by the slightly revised version of your code below.

set.seed(2020)
N <- 1000000
x <- rbinom(N, 1, 0.5)
err <- rnorm(N)
y <- err
G0 <- lm(y ~ 1)
G1 <- lm(y ~ x)
test_stat <- 2*(as.numeric(logLik(G1)) - 
    as.numeric(logLik(G0)))

p.value <- 1-pchisq(test_stat, 1)
p.value
2*(1-pnorm(abs(summary(G1)$coefficients[2, 3])))

Notice that the above mentioned ranking assumes that error variances estimates are based on the ML estimate $1/n\sum_ie_i^2$ instead of the unbiased estimate $1/(n-k)\sum_ie_i^2$. The t-statistic retrieved from lm uses the latter, so that it is not exactly correct that the squared t-statistic equals the Wald statistic, so that, as in the numerical example below where we have summary(G1)$coefficients[2,3]^2<test_stat, the ranking need not emerge. We would obtain the likelihood-based Wald statistic from summary(G1)$coefficients[2,3]^2*(N-2)/N, for which the ranking would again be satisfied.

set.seed(2020)
N <- 10
x <- rbinom(N, 1, 0.5)
err <- rnorm(N)
y <- err
G0 <- lm(y ~ 1)
G1 <- lm(y ~ x)

# LR
2*(as.numeric(logLik(G1))-as.numeric(logLik(G0)))
N*log(sum(resid(G0)^2)/sum(resid(G1)^2))

# squared t-stat 
summary(G1)$coefficients[2, 3]^2

# Wald
N*(sum(resid(G0)^2) - 
    sum(resid(G1)^2))/sum(resid(G1)^2)

# corrected squared t which equals Wald
abs(summary(G1)$coefficients[2,3])^2*N/(N-2)

Related Solutions

Solved – Likelihood Ratio Test and Wald test provide different conclusion for glm in R

The main problem is that if you're going to use the ratio as your response variable, you should be using the weights argument. You must have ignored a warning about "non-integer #successes in a binomial glm" ...

Dilution <- c(1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1, 2, 4)
NoofPlates <- rep(x=5, times=10)
NoPositive <- c(0, 0, 2, 2, 3, 4, 5, 5, 5, 5)
Data <- data.frame(Dilution,  NoofPlates, NoPositive)


fm1 <- glm(formula=NoPositive/NoofPlates~log(Dilution),
     family=binomial("logit"), data=Data, weights=NoofPlates)

coef(summary(fm1))
##               Estimate Std. Error  z value     Pr(>|z|)
## (Intercept)   4.173698  1.2522190 3.333042 0.0008590205
## log(Dilution) 1.622552  0.4571016 3.549653 0.0003857398

anova(fm1,test="Chisq")
##               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                              9     41.212              
## log(Dilution)  1   37.979         8      3.233 7.151e-10 ***

The LRT and Wald test results are still quite different ($p$-values of $4 \times 10^{-4}$ vs. $7 \times 10^{-10}$), but for practical purposes we can go ahead say they're both strongly significant ... (In this case (with a single parameter), aod::wald.test() gives exactly the same p-value as summary().)

The Wald vs profile confidence intervals are also moderately different, but whether CIs [shown below] of (0.7,2.5) (Wald) and (0.9, 2.75) (LRT) are practically different depends on the particular situation.

Wald:

confint.default(fm1)
##                   2.5 %   97.5 %
## (Intercept)   1.7193940 6.628002
## log(Dilution) 0.7266493 2.518455

Profile:

confint(fm1)
##                   2.5 %   97.5 %
## (Intercept)   2.2009398 7.267565
## log(Dilution) 0.9014053 2.757092

Solved – Low sample size: LR vs F – test

The Likelihood ratio test you're using uses a chi-square distribution to approximate the null distribution of likelihoods. This approximation works best with large sample sizes, so its inaccuracy with a small sample size makes some sense.

I see a few options for getting better Type-I error in your situation:

There are corrected versions of the likelihood ratio test, such as Bartlett's correction. I don't know much about these (beyond the fact that they exist), but I've heard that Ben Bolker knows more.
You could estimate the null distribution for the likelihood ratio by bootstrapping. If the observed likelihood ratio falls outside middle 95% of the bootstrap distribution, then it's statistically significant.

Finally, the Poisson distribution has one fewer free parameter than the negative binomial, and might be worth trying when the sample size is very small.

Best Answer

Related Solutions

Solved – Likelihood Ratio Test and Wald test provide different conclusion for glm in R

Solved – Low sample size: LR vs F – test

Related Question