This sounds somewhat like gradient tree boosting. The idea of boosting is to find the best linear combination of a class of models. If we fit a tree to the data, we are trying to find the tree that best explains the outcome variable. If we instead use boosting, we are trying to find the best linear combination of trees.
However, using boosting we are a little more efficient as we don't have a collection of random trees, but we try to build new trees that work on the examples we cannot predict well yet.
For more on this, I'd suggest reading chapter 10 of Elements of Statistical Learning:
http://statweb.stanford.edu/~tibs/ElemStatLearn/
While this isn't a complete answer of your question, I hope it helps.
1) The reason you're confused is that the term "stepwise" is used inconsistently. Sometimes it means pretty specific procedures in which $p$-values of regression coefficients, calculated in the ordinary way, are used to determine what covariates are added to or removed from a model, and this process is repeated several times. It may refer to (a) a particular variation of this procedure in which variables can be added or removed at any step (I think this is what SPSS calls "stepwise"), or it may refer to (b) this variation along with other variations such as only adding variables or only removing variables. More broadly, "stepwise" can be used to refer to (c) any procedure in which features are added to or removed from a model according to some value that's computed each time a feature (or set of features) is added or removed.
These different strategies have all been criticized for various reasons. I would say that most of the criticism is about (b), the key part of that criticism is that $p$-values are poorly equipped for feature selection (the significance tests here are really testing something quite different from "should I include this variable in the model?"), and most serious statisticians recommend against it in all circumstances. (c) is more controversial.
2) Because statistics education is really bad. To give just one example: so far as I can tell from my own education, it's apparently considered a key part of statistics education for psychology majors to tell students to use Bessel's correction to get unbiased estimates of population SD. It's true that Bessel's correction makes the estimate of the variance unbiased, but it's easy to prove that the estimate of the SD is still biased. Better yet, Bessel's correction can increase the MSE of these estimates.
3) Variable selection is practically a field unto itself. Cross-validation and train–test splits are ways to evaluate a model, possibly after feature selection; they don't themselves provide suggestions for which features to use. The lasso is often a good choice. So is best subsets.
4) In my mind, there's still no sense in using (b), especially when you could do something else in (c) instead, like using AIC. I have no objections to AIC-based stepwise selection, but be aware that it's going to be sensitive to the sample (in particular, as samples grow arbitrarily large, AIC, like the lasso, always chooses the most complex model), so don't present the model selection itself as if it were a generalizable conclusion.
If we are looking to see which variables seem to explain the response and in what way
Ultimately, if you want to look at the effects of all the variables, you need to include all the variables, and if your sample is too small for that, you need a bigger sample. Remember, null hypotheses are never true in real life. There aren't going to be a bunch of variables that are associated with an outcome and a bunch of other variables that aren't. Every variable will be associated with the outcome—the questions are to what degree, in what direction, in what interactions with other variables, etc.
Best Answer
Yes and no. By selecting only a subset of features, and creating synthetic variables, you can help/accelerate the convergence of trees. But not necessarily improve it, because synthetic variables, which are a combination of one or more variables and one or more division rules, are nothing more than nodes in a decision tree, and so the tree will find them on its own if it finds them relevant.
But the idea is good, a widely used technique is to look at the results of a simple decision tree, and use the conditions of these first nodes to create synthetic features that we reinject in a logistic regression or stepwise regression.
The objective of each split is to find the division, or more precisely the variable and the division rule, which will contribute to the strongest decrease of the heterogeneity of the son nodes on the left $\kappa_l$ and on the right $\kappa_r$. In the case where Y is a qualitative variable, several heterogeneity functions can be defined for a node: a criterion defined from the notion of Entropy or from the Gini concentration (actually there is also the CHAID criterion which is based on the statistical test of $\chi^2$). In practice, it turns out that the choice of the criterion has little influence, and it is often Gini that is chosen by default.
In the node $\kappa$, where $p_\kappa^l$ is the number of element of the class $l$ in the node $\kappa$, and $m$ the number of classes.
\begin{align*} \textit{Entropy} : &S_\kappa = - \sum_{l=1}^{m} p_\kappa^l log(p_\kappa^l) \\ \textit{Gini} : &G_\kappa = \sum_{l=1}^{m}p_\kappa^l(1 - p_\kappa^l) = 1 - \sum_{l=1}^{m}(p_{\kappa}^{l})^2 \end{align*}
As you mentioned earlier, we cannot directly use the Akaike information criterion or the Bayesian information criterion. Nevertheless, it is possible to easily apply a backward stepwise selection.
In the case of random forests, a method for selecting variables is based on the importance score of the variables (ability of a variable to predict $Y$). We thus employ a top-down (or backward) strategy where we remove step by step the least important variables as defined in the importance criterion. At each stage of the algorithm, we calculate the prediction error. The subset finally chosen is the one that minimizes the prediction error.
The agorithm can be summed up as follows :