Regression Analysis – Understanding the Power of the Regression F Test

f distributionhypothesis testingnon-centralregressionstatistical-power

The classical F-test for subsets of variables in multilinear regression has the form
$$
F = \frac{(\mbox{SSE}(R) – \mbox{SSE}(B))/(df_R – df_B)}{\mbox{SSE}(B)/df_B},
$$
where $\mbox{SSE}(R)$ is the sum of squared errors under the 'reduced' model, which nests inside the 'big' model $B$, and $df$ are the degrees of freedom of the two models. Under the null hypothesis that the extra variables in the 'big' model have no linear explanatory power, the statistic is distributed as an F with $df_R – df_B$ and $df_B$ degrees of freedom.

What is the distribution, however, under the alternative? I assume it is a non-central F (I hope not doubly non-central), but I cannot find any reference on what exactly the non-centrality parameter is. I am going to guess it depends on the true regression coefficients $\beta$, and probably on the design matrix $X$, but beyond that I am not so sure.

Best Answer

The noncentrality parameter is $\delta^{2}$, the projection for the restricted model is $P_{r}$, $\beta$ is the vector of true parameters, $X$ is the design matrix for the unrestricted (true) model, $|| x ||$ is the norm:

$$ \delta^{2} = \frac{|| X \beta - P_{r} X \beta ||^{2}}{\sigma^{2}} $$

You can read the formula like this: $E(y | X) = X \beta$ is the vector of expected values conditional on the design matrix $X$. If you treat $X \beta$ as an empirical data vector $y$, then its projection onto the restricted model subspace is $P_{r} X \beta$, which gives you the prediction $\hat{y}$ from the restricted model for that "data". Consequently, $X \beta - P_{r} X \beta$ is analogous to $y - \hat{y}$ and gives you the error of that prediction. Hence $|| X \beta - P_{r} X \beta ||^{2}$ gives the sum of squares of that error. If the restricted model is true, then $X \beta$ already is within the subspace defined by $X_{r}$, and $P_{r} X \beta = X \beta$, such that the noncentrality parameter is $0$.

You should find this in Mardia, Kent & Bibby. (1980). Multivariate Analysis.

Related Solutions

Solved – Sample size formula for an F-test

I am wondering if there is a sample size formula like Lehr's formula that applies to an F-test?

The webpage "Power Tools for Epidemiologists" explains:

Difference Between Two Means (Lehr):

Say, for example, you want to demonstrate a 10 point difference in IQ between two groups, one of which is exposed to a potential toxin, the other of which is not. Using a mean population IQ of 100, and a standard deviation of 20:

$$n_{group}=\frac {16}{(100−90/20)^2}$$

$$n_{group}=\frac{16}{(.5)^2}=64$$
Percentage Change in Means

Clinical researchers may be more comfortable thinking in terms of percentage changes rather than differences in means and variability. For example, someone might be interested in a 20% difference between two groups in data with about 30% variability. Professor van Belle presents a neat approach to these kinds of numbers that uses the coefficient of variation (c.v.) 4 and translating percentage change into a ratio of means.

Variance on the log scale (see chapter 5 in van Belle) is approximately equal to coefficient of variation on the original scale, so Lehr’s formula can be translated into a version that uses c.v.

$$n_{group}=\frac{16(c.v.)^2}{(ln(μ_0)−ln(μ_1))^2}$$

We can then use the percentage change as the ratio of means, where

$$r.m.=\frac{μ_0−μ_1}{μ0}=1−\frac{μ_1}{μ_0}$$

to formulate a rule of thumb:

$$n_{group}=\frac{16 (c.v.)^2}{(ln(r.m.))^2}$$

In the example above, a 20% change translates to a ratio of means of 1−.20=.80. (A 5% change would result in a ratio of means of 1−.05=.95; a 35% change 1−.35=.65, and so on.) So, the sample size for a study seeking to demonstrate a 20% change in means with data that varies about 30% around the means would be

$$n_{group}=\frac{16(.3)^2}{(ln(.8))^2}=29$$

An R function based on this rule would be:
1   nPC<-function(cv, pc){
2       x<-16*(cv)^2/((log((1-pc)))^2)
3       print(x)
4   }
Say you were interested in a 15% change from one group to another, but were uncertain about how the data varied. You could look at a range of values for the coefficient of variation:
1   a<-c(.05,.10,.15,.20,.30,.40,.50,.75,1)
2   nPC(a,.15)
You could use this to graphically display your results:
1   plot(a,nPC(a,.15),  ylab="Number in Each Group", 
2   xlab="By Varying Coefficent of Variation", 
3   main="Sample Size Estimate for a 15% Difference")

See also: iSixSigma "How to Determine Sample Size" and RaoSoft "Online Sample Size Calculator".

Solved – Appropriate residual degrees of freedom after dropping terms from a model

Do you disagree with @FrankHarrel's answer that parsimony comes with some ugly scientific trade-offs, anyways?

I love the link provided in @MikeWiezbicki's comment to Doug Bates' rationale. If someone disagrees with your analysis, they can do it their way, and this is a fun way to start a scientific discussion about your base assumptions. A p-value does not make your conclusion an "absolute truth".

If the decision of whether or not to include a parameter in your model comes down to "picking hairs" over what are, for scientifically meaningful samples, relatively small discrepancies in the df -- and you are not dealing with $n<p$ problems that justify more nuanced inference, anyways -- then you have a param so close to meeting your cutoffs that you should be transparent and talk about it either way: just include it, or analyze the model with and without it, but definitely transparently discuss your decision in the final analysis.

Best Answer

Related Solutions

Solved – Sample size formula for an F-test

Solved – Appropriate residual degrees of freedom after dropping terms from a model

Related Question