First, as always, be careful with any kind of automated variable selection. Start with subject-matter expertise if you can, and be sure to do EDA (exploratory data analysis).
Second, is your goal primarily to have good prediction performance, not subject-matter understanding? You seem to have thousands of observations and only tens of variables -- so you may not really need to drop any variables. Even if some are spurious, it should barely affect performance whether or not you leave them in the model. Also, you may benefit more if you try some variable transformations, add interaction terms, or use something other than linear regression.
However, to answer your question directly... if you really do need to have a sparse, interpretable model, and if a linear model really is suitable, and if the dataset is large, with a high true signal-to-noise ratio... then your procedure might not be that bad, by ideal-case statistical theory. As sample sizes get very large, AIC tends to select models that are a little too big (too many variables). K-fold cross-validation tends to pick models which are still too big, but not as big as AIC's. So there's some justification for using AIC as a "cheap" first pass to whittle down your model, then using CV as an "expensive" second pass to cut it down further. A theoretician could have some fun studying your procedure further. But for statistical practice, where the ideal case never holds, this is really quite a stretch!
Finally, definitely don't do a hypothesis test. The p-values above are all wrong, because they do not account for the fact you chose this model by a stepwise procedure. There is some quite recent work on legitimate p-values for stepwise regression, but not yet on combining it with AIC.
The typical permutation test, as you state, has a null hypothesis saying that the observations are exchangeable. If you have nested models $H_A: Y_i = \beta X_i + \epsilon_i$ and $H_0: Y_i = \epsilon_i$, with $\epsilon_i$ iid and zero-mean, then the exchangeability is a consequence of the form of the smaller model.
If you have $H_A: Y_i = \beta X_i + \gamma Z_i + \epsilon_i$ and $H_0: Y_i = \beta X_i + \epsilon_i$, then the observations $Y_i$ are not exchangeable even under the small model. Instead, you could permute the null-model residuals $Y_i - \hat \beta_0 X_i $, where $\hat \beta_0$ is an estimate of $\beta$ under the null hypothesis $\gamma=0$. If $\hat \beta_0$ is the usual least-squares estimate $(X^TX)^{-1}X^TY$, then under the null model, you're permuting $(I - X(X^TX)^{-1}X^T)\epsilon$, whose entries are not really exchangeable. For example, if $X$ is $[1, 0, 0, 0]$, the first residual is deterministically zero and the rest aren't. I recommend the bootstrap strategy from glen_b's answer, linked in comment above.
Edit: what to do with nonlinear models
As pointed out in comments, you can fit linear model to the residuals of another linear model and achieve the same predictions as you would via simultaneous estimation. This makes it obvious how to proceed (if you have nested linear models and you don't care about the lack of exchangeability of the null-model residuals): fit the null model, permute the residuals, then fit the remaining variables.
You can also do this with a nonlinear model:
- Fit the smaller model to data $y$, producing predictions $\hat y_0$ and predicted residuals $r_0 = y - \hat y_0$.
- For i in 1 to 10,000:
- Fit the full model to $\pi_i(r_0) + \hat y_0$ and record the measure of fit ($\pi_i$ is a random permutation)
- Fit the full model to the actual data, record the measure of fit, and compare it to the null distribution as generated in step 2.
To implement this, you only need each model to generate predictions. You don't need them to be linear. But again, don't do this, because those residuals are not actually exchangeable, even under the null model.
Best Answer
For the benefit of other readers I will briefly explain what the permutation test is in this context.
In this specific example there is a binary dependent variable $y$, a large number of independent variables $X$ with $n\ll p$, and a specific algorithm (PLS-DA) to predict $y$ from $X$ with one hyperparameter (number $k$ of PLS components). To find the optimal value of $k$ one uses cross-validation. To estimate the performance of this model-building procedure one uses nested cross-validation. Imagine that nested cross-validation is done and the performance is, say, $43\%$ error rate or $R_\mathrm{CV}^2=0.2$. Question: are these numbers significantly different from chance? Is 43% significantly less than 50%? Is 0.2 significantly larger than 0.0? Answer: Shuffle the labels and repeat the entire procedure; do this many times to obtain the null distribution of performance; compare your actual performance with this null distribution to get the $p$-value.
What does insignificant $p$-value mean here?
Saying that it means that "the model is overfitting" is a bit of an understatement. The model includes regularization and cross-validation is performed to choose the optimal value of the regularization parameter (here, $k$). So the model is as non-overfitting as it could possibly be. So what is going on?
As you cannot be sure in #3 (that's the outcome of the permutation test), you have to consider that #2 is a real possibility. In other words: the model is absolutely useless or at least you failed to show otherwise.
Using the absolutely useless model for variable selection
There is clearly no sense in doing that. You could as well select variables at random. Or cherry pick variables that are most correlated with $y$.
I should add that PLS, PLS-DA and related approaches are popular in some applied fields but practically unknown in the mainstream statistics and machine learning. The most standard approach for binary classification together with variable selection would be logistic regression with elastic net (ridge+lasso) regularization. It might be worth considering.
Disclaimer: I have no practical experience with PLS or PLS-DA models.