The literature distinguishes between two types of permutations tests: (1) the randomization test is the permutation test where exchangeability is satisfied by random assignment of experimental units to conditions; (2) the permutation test is the exact same test but applied to a situation where other assumptions (i.e., other than random assignment) are needed to justify exchangeability.
Some references regarding the naming conventions (i.e., randomization vs permutation):
Kempthorne & Doerfler, Biometrika, 1969; Edgington & Onghena, Randomization Tests, 4th Ed., 2007]
For assumptions, the randomization test (i.e., Fisher's randomization test for experimental data) only requires what Donald Rubin refers to as the stable unit treatment value assumption (SUTVA). See Rubin's 1980 comment on Basu's paper in JASA. SUTVA is also one of the fundamental assumptions (along with strong ignorability) for causal inference under the Neyman-Rubin potential outcomes model (cf. Paul Holland's 1986 JASA paper). Essentially, SUTVA says that there is no interference between units and that the treatment conditions are the same for all recipients. More formally, SUTVA assumes independence between the potential outcomes and the assignment mechanism.
Consider the two-sample problem with participants randomly assigned to a control group or a treatment group. SUTVA would be violated if, for example, two study participants were acquainted and the assignment status of one of them exerted some influence on the outcome of the other. This is what is meant by no interference between units.
The above discussion applies to the randomization test wherein participants were randomly assigned to groups. In the context of a permutation test, SUTVA is also necessary, but it may not rest on the randomization because there was none.
In the absence of random assignment, the validity of permutation tests may rely on distributional assumptions like identical shape of distribution or symmetric distributions (depending on the test) to satisfy exchangeability (see Box and Anderson, JRSSB, 1955]).
In an interesting paper, Hayes, Psych Methods, 1996, shows through simulation how Type I error rates may become inflated if permutation tests are used with non-randomized data.
If none the $x$-variables relate to the mean response, then the $y$'s are a set of observations from distributions with the same expectation $E(y|X)=\mu_Y$.
The idea of a permutation test is that if we further assume that the distributions are the same (since for a permutation test to work the variables from which the observations being permuted are drawn would need to be exchangeable), the labels "$i$" and "$j$" don't carry any information (i.e. associating $y_i$ with the vector $X_i$ doesn't tell you any more than placing it with $X_j$) -- the labels on the $y$'s are arbitrary
So if they're arbitrary, permuting the y's should not change the distribution of the test statistic.
On the other hand, if at least some of the predictors in the model are useful, the labels are meaningful (i.e. $E(y|X)\neq\mu_Y$) -- you couldn't scramble the order of $y$ without losing the connection between $y_i$ and its associated $X_i$.
We take all possible scrambled $y$'s on whatever your test statistic is; most won't be more than very weakly related to the $X's$ (since it's all random), and a few will be more strongly related (by chance we picked an order that's related to the $X$'s).
Now, if the labels really don't matter, your sample order won't be expected to be especially atypical except by rare chance.
So let's consider the F-statistic. If your sample F-statistic falls into the extreme upper tail of the permutation distribution of your F statistic either (a) the labels don't matter but an extremely rare event occurred by chance (the errors were correlated with $x$'s by chance, resulting in a large F) or (b) the hypothesis that the labels don't matter is false.
We choose the proportion of times we're prepared to reject when in the situation where the labels don't matter (our significance level), and call the most extreme fraction $\alpha$ of our permutation-statistics our rejection region. If your sample $F$ is in there, we reject.
The advantage over the use of F-tables is you're no longer reliant on the error distribution being normal. All you need is that under the null the variates from which the observations are drawn are exchangeable -- in effect, that they have the same distribution.
Of course in large samples, we can't hope to evaluate the entire permutation distribution. In those cases, we sample from it (typically with replacement).
While the p-value you'd obtain from random sampling of the permutation distribution is a random variable, you can compute standard errors or margin of error for the $p$ value and so decide in advance how large a sample you want to take.
e.g. roughly speaking, if for p-values near 0.05 I want my interval to include the true p-value about 95% of the time and to be of size $\pm$0.002, I need $\sqrt{0.05\times0.95/n}$ to be less than about 0.001. From that I can back out an approximate $n$ ... somewhere around 50000, and so, for example, if I got an estimated p value of 0.045 my confidence interval for $p$ in that region would not include 0.05. However, if I had only simulated 1000 times (or even 4000 times) and got an estimated $p$ of 0.045, I couldn't be reasonably confident that the true $p$ would not be above 0.05.
(With such sampling I tend to use $10^5$ -- or more -- re-samples for this reason, unless the calculations are very slow.)
Permutation/randomization tests have one very nice advantage -- you can tailor a statistic to match exactly what you want to be able to pick up. You could start with a statistic with good properties near some model and then robustify your statistic against some form of anticipated deviation from that if you wish. It's all fine as long as under the null hypothesis the observations remain exchangeable (and sometimes it may be difficult to construct a statistic such that they are -- this is related to the problem of constructing a suitable rank statistic, but here we're not necessarily ranking).
[Rank based nonparametric tests like the Kruskal-Wallis are essentially permutation tests performed on ranks. They have the advantage that because ranks are known before you start, the distribution of the test statistic under the null doesn't change when your sample does. That's a big advantage when people don't all have computers on their desks, but less necessary now. Of course in many cases rank statistics have other nice advantages, such as robustness to heavy tails]
Fisher once made an argument that statistics derived by the more usual "random sampling" approach (like the usual t-test) are really only valid as far as they are approximations to a permutation test (I'll see if I can track down a proper reference for this claim). However, a permutation test doesn't require a random-sampling argument to be valid (it can be based off random assignment to treatment) -- but on the other hand, if you want to generalize the conclusions to the wider population of interest, you may need to invoke something similar to a random-sample-of-the-population argument at that point.
Some precautions related to the calculation of p-values -- see
Gordon K. Smyth & Belinda Phipson (2010)
Permutation P-values Should Never Be Zero:
Calculating Exact P-values When Permutations Are Randomly Drawn
Stat Appl Genet Mol Biol. 9: Article 39.
doi: 10.2202/1544-6115.1585. Epub 2010 Oct 31.
http://www.statsci.org/webguide/smyth/pubs/permp.pdf (with corrections)
Best Answer
The typical permutation test, as you state, has a null hypothesis saying that the observations are exchangeable. If you have nested models $H_A: Y_i = \beta X_i + \epsilon_i$ and $H_0: Y_i = \epsilon_i$, with $\epsilon_i$ iid and zero-mean, then the exchangeability is a consequence of the form of the smaller model.
If you have $H_A: Y_i = \beta X_i + \gamma Z_i + \epsilon_i$ and $H_0: Y_i = \beta X_i + \epsilon_i$, then the observations $Y_i$ are not exchangeable even under the small model. Instead, you could permute the null-model residuals $Y_i - \hat \beta_0 X_i $, where $\hat \beta_0$ is an estimate of $\beta$ under the null hypothesis $\gamma=0$. If $\hat \beta_0$ is the usual least-squares estimate $(X^TX)^{-1}X^TY$, then under the null model, you're permuting $(I - X(X^TX)^{-1}X^T)\epsilon$, whose entries are not really exchangeable. For example, if $X$ is $[1, 0, 0, 0]$, the first residual is deterministically zero and the rest aren't. I recommend the bootstrap strategy from glen_b's answer, linked in comment above.
Edit: what to do with nonlinear models
As pointed out in comments, you can fit linear model to the residuals of another linear model and achieve the same predictions as you would via simultaneous estimation. This makes it obvious how to proceed (if you have nested linear models and you don't care about the lack of exchangeability of the null-model residuals): fit the null model, permute the residuals, then fit the remaining variables.
You can also do this with a nonlinear model:
To implement this, you only need each model to generate predictions. You don't need them to be linear. But again, don't do this, because those residuals are not actually exchangeable, even under the null model.