Permutation Test – How to Use Permutation Test for F-Statistics in OLS?

hypothesis testingpermutation-test

Can anyone explain to me that what is the point of performing a permutation test?

For example, in the OLS analysis, we fit $X\beta$ to $Y_i$,for the permutation test, we let $\tau$ be a permutation of {1,…,n}, and then fit $X_i\beta$ to permuted responses $Y_τ(i)$. We do this repeatedly, for permutations $\tau_1,…,\tau_N$, for say N = 4000. Now we see where the original observed F statistic $F_{obs}$ falls among the 4000 permutation F-test values.

What is the point of doing that? And what are the theoretical reasonings for permutation test?

Best Answer

If none the $x$-variables relate to the mean response, then the $y$'s are a set of observations from distributions with the same expectation $E(y|X)=\mu_Y$.

The idea of a permutation test is that if we further assume that the distributions are the same (since for a permutation test to work the variables from which the observations being permuted are drawn would need to be exchangeable), the labels "$i$" and "$j$" don't carry any information (i.e. associating $y_i$ with the vector $X_i$ doesn't tell you any more than placing it with $X_j$) -- the labels on the $y$'s are arbitrary

So if they're arbitrary, permuting the y's should not change the distribution of the test statistic.

On the other hand, if at least some of the predictors in the model are useful, the labels are meaningful (i.e. $E(y|X)\neq\mu_Y$) -- you couldn't scramble the order of $y$ without losing the connection between $y_i$ and its associated $X_i$.

We take all possible scrambled $y$'s on whatever your test statistic is; most won't be more than very weakly related to the $X's$ (since it's all random), and a few will be more strongly related (by chance we picked an order that's related to the $X$'s).

Now, if the labels really don't matter, your sample order won't be expected to be especially atypical except by rare chance.

So let's consider the F-statistic. If your sample F-statistic falls into the extreme upper tail of the permutation distribution of your F statistic either (a) the labels don't matter but an extremely rare event occurred by chance (the errors were correlated with $x$'s by chance, resulting in a large F) or (b) the hypothesis that the labels don't matter is false.

We choose the proportion of times we're prepared to reject when in the situation where the labels don't matter (our significance level), and call the most extreme fraction $\alpha$ of our permutation-statistics our rejection region. If your sample $F$ is in there, we reject.

The advantage over the use of F-tables is you're no longer reliant on the error distribution being normal. All you need is that under the null the variates from which the observations are drawn are exchangeable -- in effect, that they have the same distribution.

Of course in large samples, we can't hope to evaluate the entire permutation distribution. In those cases, we sample from it (typically with replacement).

While the p-value you'd obtain from random sampling of the permutation distribution is a random variable, you can compute standard errors or margin of error for the $p$ value and so decide in advance how large a sample you want to take.

e.g. roughly speaking, if for p-values near 0.05 I want my interval to include the true p-value about 95% of the time and to be of size $\pm$0.002, I need $\sqrt{0.05\times0.95/n}$ to be less than about 0.001. From that I can back out an approximate $n$ ... somewhere around 50000, and so, for example, if I got an estimated p value of 0.045 my confidence interval for $p$ in that region would not include 0.05. However, if I had only simulated 1000 times (or even 4000 times) and got an estimated $p$ of 0.045, I couldn't be reasonably confident that the true $p$ would not be above 0.05.

(With such sampling I tend to use $10^5$ -- or more -- re-samples for this reason, unless the calculations are very slow.)

Permutation/randomization tests have one very nice advantage -- you can tailor a statistic to match exactly what you want to be able to pick up. You could start with a statistic with good properties near some model and then robustify your statistic against some form of anticipated deviation from that if you wish. It's all fine as long as under the null hypothesis the observations remain exchangeable (and sometimes it may be difficult to construct a statistic such that they are -- this is related to the problem of constructing a suitable rank statistic, but here we're not necessarily ranking).

[Rank based nonparametric tests like the Kruskal-Wallis are essentially permutation tests performed on ranks. They have the advantage that because ranks are known before you start, the distribution of the test statistic under the null doesn't change when your sample does. That's a big advantage when people don't all have computers on their desks, but less necessary now. Of course in many cases rank statistics have other nice advantages, such as robustness to heavy tails]

Fisher once made an argument that statistics derived by the more usual "random sampling" approach (like the usual t-test) are really only valid as far as they are approximations to a permutation test (I'll see if I can track down a proper reference for this claim). However, a permutation test doesn't require a random-sampling argument to be valid (it can be based off random assignment to treatment) -- but on the other hand, if you want to generalize the conclusions to the wider population of interest, you may need to invoke something similar to a random-sample-of-the-population argument at that point.

Some precautions related to the calculation of p-values -- see

Gordon K. Smyth & Belinda Phipson (2010)
Permutation P-values Should Never Be Zero:
Calculating Exact P-values When Permutations Are Randomly Drawn
Stat Appl Genet Mol Biol. 9: Article 39.
doi: 10.2202/1544-6115.1585. Epub 2010 Oct 31.

http://www.statsci.org/webguide/smyth/pubs/permp.pdf (with corrections)