The lm
function calculates the coefficients, but it does not calculate F-statistics or p-values, you need to run another function (summary
, anova
, etc.) on the results to see p-values. What the p-values mean depend on which function calculated them and how it was called. You seem to have run at least some of these functions based on your question, but it is not clear which ones and how they were run.
You first need to decide what question or questions you are trying to answer. Then based on those questions you can decide on which functions to run on your regression and which tests to examine (sometimes (often) additional tests are included which should just be ignored).
Also, the F-statistic (or t-statistic) is a step to a p-value, we don't compare p-values to F-statistics, we compute p-values from F-statistics.
When you run summary
on a linear regression it computes an overall F-test (reported near the bottom) that compares the full model to an overall mean. This answers the question of if any subset (including the whole set) of the potential predictors are significantly related to the response. The summary
function also does an adjusted t-test for each predictor testing if that predictor adds significantly to the prediction above and beyond the effect of all other predictors in the model.
The anova
function when given a single regression object does a set of sequential tests. Reading from top to bottom the first test is that the first predictor is significant by itself, then the second predictor is significant above and beyond the 1st (but ignoring the others), the 3rd test is that the 3rd predictor adds significantly above and beyond the 1st 2, etc. These tests are really only meaningful if you put the predictors in a specific order to begin with because of the tests of interest. This functionality is mainly left over from the days when a single analysis took hours or even days. Now we can just fit a new model with a couple keystrokes and a few seconds, so the tests of interest don't need to be as planned out ahead of time.
If the anova
function is given 2 or more (nested) regression models then it does a full and reduced model F-test where the null hypothesis is that the simpler model fits just as well as the fuller model and the alternative is that the full model adds information beyond what is in the simpler model (if more than 2 models then it compars 1 vs. 2, then 2 vs. 3, etc.). This tests all the terms in the full model but not the reduced simultaneously.
I have not figured out what question stepwise regression answers, just that it does not answer any of the questions that I am interested in. The consensus is moving away from doing automated stepwise regression.
You calculate PRESS on a model trained on $n$ values to get an idea of its out-of-sample performance, by leaving out one sample at a time. So while you indeed end up with $n$ models to determine the statistic, you eventually use the original model trained on all $n$ values.
Since you are only leaving out a single observation at a time (like in LOOCV), the addition of the 'last' sample has minimal influence on the final model. Because of this you can safely use PRESS to compare models, even though the actual models you are comparing were not used to calculate it.
If you have a larger sample size, you could consider a form of nested cross-validation, comparing models with the inner cross-validation and evaluating the performance of the 'winning' model on the outer cross-validation loop.
Best Answer
Note that the predicted residual sum of squares, PRESS, is got by jack-knifing the sample: there's no sense in calculating it for training & test sets. Calculate it for a model fitted to the whole sample (& compare it to the RSS to assess the amount of over-fitting). For ordinary least-squares regression there's an analytic solution:
$$\sum_i \left(\frac{e_i}{1-h_{ii}}\right)^2$$
where $e_i$ is the $i$th residual & $h_{ii}$ its leverage—from the diagonal of the hat matrix
$$H=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}$$
(where $X$ is the design matrix).
In general cross-validation & bootstrap validation are preferable to splitting a sample into training & test sets: you don't lose precision in the estimates as when fitting on a smaller training set, & the performance measure on the test set will be less variable. How preferable depends on sample size.