Firstly, a method that first looks at univariate correlations for pre-identifying things that should go into a final model, will tend to do badly for a number of reasons: ignoring model uncertainy (single selected model), using statistical significance/strength of correlation as a criterion to select (if it is about prediction, you should rather try to assess how much something helps for prediction - these are not necessarily the same thing), "falsely" identifying predictors in univariate correlations (i.e. another predictor is even better, but because the one you look at correlates a bit with it, it looks like it correlates pretty well with the outcome) and missing out on predictors (they may only show up/become clear once other ones are adjusted for).
Additionally, not wrapping this into any form of bootstrapping/cross-validation/whatever to get a realistic assessment of your model uncertainty is likely to mislead you.
Furthermore, treating continuous predictors as having linear effects can often be improved upon by methods that do not make such an assumption (e.g. RF).
Using RF as a pre-selection for a linear model is not such a good idea. Variable importance is really hard to interpret and it is really hard (or meaningless?) to set a cut-off on it. You do not know whether variable importance is about the variable itself or about interactions, plus you are losing out on non-linear transformations of variables.
It depends in part of what you want to do. If you want good predictions, maybe you should not care too much about whether your method is a traditional statistical model or not.
Of course, there are plenty of things like the elastic net, LASSO, Bayesian models with the horseshoe prior etc. that fit better into a traditional modeling framework and could also accomodate e.g. splines for continuous covariates.
It does not use a separate training and testing set. Instead, standard accuracy estimation in random forests takes advantage of an important feature: bagging, or bootstrap aggregation.
To construct a random forest, a large number of data subsets are generated by sampling with replacement from the full dataset. A separate decision tree is fit to each bootstrap data subset, the trees jointly forming the random forest. Each data point from the full dataset is present in approximately 2/3 of the bootstrap data subsets, and absent from the remaining 1/3. You can therefore use the 1/3 of trees that do not contain a point to predict what their value would be; these are called out-of-bag (OOB) estimates. This process avoids the overfitting problem (and arguably makes crossvalidation redundant for this purpose) since the points were not present in the trees used to predict them. By repeating this for every point in the full dataset and comparing the OOB predictions against the true values, you can calculate the accuracy of the random forest.
The mean decrease in accuracy metric (generally recommended) for a variable is calculated by permuting the values of this variable across the entire dataset and estimating how the accuracy of the random forest changes.
The mean decrease in Gini metric is explained this way by Breiman & Cutler (which I took from this helpful answer):
Every time a split of a node is made on variable m the gini impurity
criterion for the two descendent nodes is less than the parent node.
Adding up the gini decreases for each individual variable over all
trees in the forest gives a fast variable importance that is often
very consistent with the permutation importance measure.
Best Answer
Using the cv values from randomforest is a valid approach to feature selection, if you've selected sensible settings for your oob sampling.
However, with only 70 observations you're not going to get great results. For one, it's really hard generate enough "folds" to get a meaningful validation number while still having enough data to build the forest.
Random forests work best when there are a lot more than 70 rows to sample from - I don't know how many trees you're using (and at what depth), but you'll quickly run out of unique 'draws' at each decision point and you'll see suboptimal performance.
The real challenge of your situation is how to do variable selection without many input rows. A traditional approach to this kind of situation would be a penalized linear regression, like ridge or lasso. These tend to perform best for variable selection in cases with very little input.
All this being said, given the very small amount of data you are working with, it's going to be really hard to get a good model even if you use the most sophisticated techniques.