Regression – General Linear Hypothesis Test Statistic: Equivalence of Two Expressions Explained

anovalinear modelregression

Assume a general linear model $y = X \beta + \epsilon$ with observations in an $n$-vector $y$, a $(n \times p)$-design matrix $X$ of rank $p$ for $p$ parameters in a $p$-vector $\beta$. A general linear hypothesis (GLH) about $q$ of these parameters ($q < p$) can be written as $\psi = C \beta$, where $C$ is a $(q \times p)$ matrix. An example for a GLH is the one-way ANOVA hypothesis where $C \beta = 0$ under the null.

The GLH-test uses a restricted model with design matrix $X_{r}$ where the $q$ parameters are set to 0, and the corresponding $q$ columns of $X$ are removed. The unrestricted model with design matrix $X_{u}$ makes no restrictions, and thus contains $q$ free parameters more – its parameters are a superset of those from the restricted model, and the columns of $X_{u}$ are a superset of those from $X_{r}$.

$P_{u} = X_{u}'(X_{u}'X_{u})^{-1} X'$ is the orthogonal projection onto subspace $V_{u}$ spanned by $X_{u}$, and analogously $P_{r}$ onto $V_{r}$. Then $V_{r} \subset V_{u}$. The parameter estimates of a model are $\hat{\beta} = X^{+} y = (X'X)^{-1} X' y$, the predictions are $\hat{y} = P y$, the residuals are $(I-P)y$, the sum of squared residuals SSE is $||e||^{2} = e'e = y'(I-P)y$, and the estimate for $\psi$ is $\hat{\psi} = C \hat{\beta}$. The difference $SSE_{r} – SSE_{u}$ is $y'(P_{u}-P_{r})y$. Now the univariate $F$ test statistic for a GLH that is familiar (and understandable) to me is:
$$
F = \frac{(SSE_{r} – SSE_{u}) / q}{\hat{\sigma}^{2}} = \frac{y' (P_{u} – P_{r}) y / q}{y^{t} (I – P_{u}) y / (n – p)}
$$

There's an equivalent form that I don't yet understand:
$$
F = \frac{(C \hat{\beta})' (C(X'X)^{-1}C')^{-1} (C \hat{\beta}) / q}{\hat{\sigma}^{2}}
$$

As a start
$$
\begin{array}{rcl}
(C \hat{\beta})' (C(X'X)^{-1}C')^{-1} (C \hat{\beta}) &=& (C (X'X)^{-1} X' y)' (C(X'X)^{-1}C')^{-1} (C (X'X)^{-1} X' y) \\
~ &=& y' X (X'X)^{-1} C' (C(X'X)^{-1}C')^{-1} C (X'X)^{-1} X' y
\end{array}
$$

How do I see that $P_{u} – P_{r} = X (X'X)^{-1} C' (C(X'X)^{-1}C')^{-1} C (X'X)^{-1} X'$?
What is the explanation for / motivation behind the numerator of the 2nd test statistic? – I can see that $C(X'X)^{-1}C'$ is $V(C \hat{\beta}) / \sigma^{2} = (\sigma^{2} C(X'X)^{-1}C') / \sigma^{2}$, but I can't put these pieces together.

Best Answer

For your second question, you have $\mathbf{y}\sim N(\mathbf{X}\boldsymbol{\beta},\sigma^2 \mathbf{I})$ and suppose you're testing $\mathbf{C}\boldsymbol{\beta}=\mathbf{0}$. So, we have that (the following is all shown through matrix algebra and properties of the normal distribution -- I'm happy to walk through any of these details)

$ \mathbf{C}\hat{\boldsymbol{\beta}}\sim N(\mathbf{0}, \sigma^2 \mathbf{C(X'X)^{-1}C'}). $

And so,

$ \textrm{Cov}(\mathbf{C}\hat{\boldsymbol{\beta}})=\sigma^2 \mathbf{C(X'X)^{-1}C}. $

which leads to noting that

$ F_1 = \frac{(\mathbf{C}\hat{\boldsymbol{\beta}})'[\mathbf{C(X'X)^{-1}C'}]^{-1}\mathbf{C}\hat{\boldsymbol{\beta}}}{\sigma^2}\sim \chi^2 \left(q\right). $

You get the above result because $F_1$ is a quadratic form and by invoking a certain theorem. This theorem states that if $\mathbf{x}\sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, then $\mathbf{x'Ax}\sim \chi^2 (r,p)$, where $r=\textrm{rank}(A)$ and $p=\frac{1}{2}\boldsymbol{\mu}'\mathbf{A}\boldsymbol{\mu}$, iff $\mathbf{A}\boldsymbol{\Sigma}$ is idempotent. [The proof of this theorem is a bit long and tedious, but it's doable. Hint: use the moment generating function of $\mathbf{x'Ax}$].

So, since $\mathbf{C}\hat{\boldsymbol{\beta}}$ is normally distributed, and the numerator of $F_1$ is a quadratic form involving $\mathbf{C}\hat{\boldsymbol{\beta}}$, we can use the above theorem (after proving the idempotent part).

Then,

$ F_2 = \frac{\mathbf{y}'[\mathbf{I} - \mathbf{X(X'X)^{-1}X'}]\mathbf{y}}{\sigma^2}\sim \chi^2(n-p-1) $

Through some tedious details, you can show that $F_1$ and $F_2$ are independent. And from there you should be able to justify your second $F$ statistic.

Related Solutions

Solved – Does the p-value in the incremental F-test determine how many trials I expect to get correct

There are a lot of issues here. The question specifically is about a difference in performance based on the range of values of $x$. This is easily explained. These tests compare amounts of variation of residuals compared to the fits. A polynomial of degree $d$ and coefficients bounded in absolute value by $k$ (equal to $5$ here) can have a range over the coordinates from $0$ to $u$ at least equal to $k\left(u + u^2 + \cdots + u^d\right)$ = $k u\left(u^{d}-1\right)/\left(u-1\right)$. When you change $u$ from $3$ to $30$ the change in potential ranges is huge. E.g., for $d=10$ the maximum in one case is on the order of $3^{11}$ and in the other case it is $10^{10}$ times as great. At this point, the noise (whose standard deviation is a tiny $0.01$) is inconsequential. Thus, even when the coefficient of $p^{10}$ is incredibly tiny, it will have an important (and therefore detectable) effect on the data.

Here is a plot of ten of your random polynomials (all of order $10$). Note the astronomical scale on the y-axis and observe how the highest term dominates the values.

You ought to consider a different universe of models. For instance, use polynomials of the form

$$p(x) = \sum_{i=0}^d \alpha_i \left(\frac{x}{u}\right)^i$$

defined on the range $[0,u]$. Here is a collection of them, once more with the coefficients varying randomly in $[-5,5]$ and all still of tenth order:

A rigorous test would add noise with standard deviation about the same as the variation in the polynomial values: around $10$ or so.

There are other concerns here: please read the replies by @gung and @jbowman. Consider, too, that you are using a restricted version of forward stepwise regression and do some research on the pros and cons of that approach for model building. Finally, note that in general, unless theory specifically indicates a polynomial model and suggests its order, fitting polynomials to data can be a deceptively poor approach: a tiny bit of overfitting can result in models that are grossly bad because higher degree polynomials can (and often do) vary so wildly in between the data values and will be horrible extrapolators.

Solved – AIC and BIC criterion for Model selection, how is it used in this paper

In my answer here I show that in a case like the present one, in which we test nested models against each other, the minimum AIC rule selects the larger model (i.e., rejects the null) if the likelihood ratio statistic $$ \mathcal{LR}=n[\log(\widehat{\sigma}^2_1)-\log(\widehat{\sigma}^2_2)], $$ with $\widehat{\sigma}^2_i$ the ML error variance estimates of the restricted and unrestricted models, exceeds $2K_2$. Here, $K_2$ is the number of additional variables in the larger model. In your case, $K_2=1$, corresponding to $x_{i2}$. Thus, select the larger model if $\mathcal{LR}>2$.

Now, in the present linear regression framework, the absolute value of the $t$-statistic $$|t|=\left| \dfrac{\sqrt{n}\hat\beta(U) }{\sigma_\beta} \right|$$ is simply the positive square root of the LR-statistic.

(Actually, this in general only holds asymptotically, as we have $t^2=F$, the $F$- or Wald-statistic, which is in general not numerically identical to $\cal{LR}$ in finite samples. Leeb and Pötscher however assume that $\sigma^2$ is known, which, as is shown here, restores exact numerical equivalence of Wald, LR and score statistics in this setup.)

Hence, going with the larger model according to the mininum AIC rule when $\mathcal{LR}>2=c$ corresponds to rejecting when the t-statistic exceeds $\sqrt{c}$.

It is worth pointing out that this implies that, in this case, the AIC rule is nothing but a hypothesis test at level $\alpha=0.157$, as (the LR statistic being $\chi^2_1$ under the present $H_0$ of the smaller model being the correct one)

> 1-pchisq(2,df = 1)
[1] 0.1572992

> 2*pnorm(-sqrt(2))
[1] 0.1572992

Solving the equation $1.96=\sqrt{\ln n}$ for $n$ gives that BIC would be of the same size as a test at the 5%-level at $n\approx46$.

It does not seem to be a general result that AIC corresponds to a liberal nested hypothesis test. For example, when $K_2=8$, AIC is equivalent to rejecting when $\mathcal{LR}>16$, which, under the null, has probability

> 1-pchisq(2*8,df = 8)
[1] 0.04238011

In fact, the probability tends to zero with $K_2$:

Best Answer

Related Solutions

Solved – Does the p-value in the incremental F-test determine how many trials I expect to get correct

Solved – AIC and BIC criterion for Model selection, how is it used in this paper

Related Question