Solved – Why does the linear test statistic of GLM follow F-distribution

generalized linear model

As a MATLAB user, I have been using coefTest to perform linear hypothesis testing. For example in $y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3$, if I want to test if $\beta_1=\beta_2$, then I can simply use a linear contrast $$C=\begin{bmatrix}0&1&-1&0\end{bmatrix}.$$
Then, the test statistic will follow an $F$-distribution, whereby I can compute my $p$-value.

Does this hold for all generalized linear models? In particular, I am concerned about the general linear model (Gaussian case) and the logistic regression (binomial case).
If so, why does the test statistic, despite so many different instantiations of GLM, always follow an $F$-distribution?

It seems that many sources just take this as granted, probably because this is too basic. Yet, I need to understand why so as to get assured enough to use it. I would sincerely appreciate if someone can point me to an authoritative book.

Best Answer

Why does the linear test static of GLM follow F-distribution?

It doesn't.

Then, the test statistic will follow an $F$-distribution [...] does this hold for all generalized linear models?

There's no result that establishes it in the general case, and indeed we can show (e.g. by simulation in particular instances) that it's not the case in general.

It holds for the Gaussian case, of course, but the derivation relies on the normality of the data. You can see it's not the case for logistic regression, since the data (and hence "F"-statistics based on the data) are discrete.

There is an asymptotic chi square result. This, combined with Slutsky's theorem should give us that the F-statistic will asymptotically be distributed as a scaled chi-square.

However, in sufficiently large samples (where how large "large" is will depend on a number of things), we might anticipate that The F distribution would still be approximately correct, since both the $F$ distribution being used to figure out p-values, and the actual distribution of the test statistic are both going to the same scaled chi-square distribution asymptotically.

We see the same issue with the common use of t-tests for parameter significance in GLMs (which many packages do) even though it's only t-distributed for the Gaussian case; for the others we only have an asymptotic normal result (but a similar argument for why the $t$ shouldn't do badly in sufficiently large samples can be made).

I don't have a good book suggestion. Some books give a handwavy argument for using the $F$ (some akin to mine above), others seem to ignore the need to justify it at all.

Related Solutions

Solved – Why is generalized linear model (GLM) a semi-parametric model

A GLM isn't a semi-parametric model, but the output from typical use of GLMs can be justified with only semi-parametric assumptions.

If one only assumes that the observations $Y_1, Y_2, ... Y_n$ are independent and that $$ g(\mathbb{E}[\,Y_i|X_i=x_i\,]) = x_i^T\beta $$ then, under mild regularity conditions, solving the equations $$ \sum_i\frac{\partial g^{-1}(x_i^T\beta)}{\partial \beta}w(g^{-1}(x_i^T\beta))(Y_i - g^{-1}(x_i^T\beta)) = \mathbf{0} $$ provides consistent estimates for parameter $\beta$. The weighting term $w$ is arbitrary, but it determines the efficiency of this approach, and the best option is to use weights inversely proportional to the variance of $Y_i$, if you know this.

How does this connect to GLMs? Well, the estimating equation above is just the score equation (i.e. the one that determines the MLE), under the assumption of a GLM. A particularly simple case of thise is when we use the "canonical" link function, chose so that part of the derivative term cancels with the inverse-variance weights, and we get $$ \sum_i x_i(Y_i - g^{-1}(x_i^T\beta)) = \mathbf{0}, $$ which should look familiar to anyone who's studied linear regression, or logistic regression, or Poisson regression.

In general, we can view the point estimates from GLMs as MLEs under a particular fully parametric model for $Y$, or as consistent & efficient estimates resulting from assumptions on only the first and second moments of $Y$ - i.e. a semi-parametric model.

Similar arguments apply to the confidence intervals these methods provide; see e.g. McCullagh and Nelder's book for the details.

Solved – Why does a binomial glm give negative predictions

Assuming that you are using the predict.glm() from the stats package.

A quote from the manual, under the entry explaining the type parameter:

Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and ‘type = "response"’ gives the predicted probabilities.

So instead, try the following:

predict(glm(cbind(suc,fail)~c(1:10),family=binomial), type="response")

Best Answer

Related Solutions

Solved – Why is generalized linear model (GLM) a semi-parametric model

Solved – Why does a binomial glm give negative predictions

Related Question