I think you are doing a "best practice" approach already for feature selection.
Using a regularised regression approach like LASSO and complementing those insights with a distribution-free model like Random Forest to decide the most important features is probably the best way to go.
Some minor suggestions: I would propose using Elastic Net to potentially have a small amount of $L_2$ regularisation. This should make our feature selection a bit more stable in case of correlated features. Similarly, taking a slightly more sophisticated approach of using Random Forests within a full Recursive Feature Elimination framework like Baruta (See Nilsson et al. for background, CRAN link) instead of simply relying on simple Random Forest variable importance will probably be beneficial.
Having said the above, we should use such feature selection approaches only if we cannot work with our original full dataset and/or we have problem collecting the features in question in the future (eg. too costly). Using a modelling approach that can actively regularise the resulting model (eg. gradient boosting machines where we can regularise the fit by properly picking the learning rate, tree-depth, minimum number of children per leaf node, etc.) is the best way to go. In that way we know we are not reusing our data, as well as that we are not losing valuable information that was potentially missed out during our feature selection step.
An issue, not touched upon is performing data reduction using some dimensionality reduction technique like PCA, ICA, NNMF, etc. These techniques do not "select features" per se but rather "combine features" to create meta-features of variable informational value. They can be very useful if we need a small subset of "information-rich" features. Nevertheless, these "information-rich" features are not guarantee to include more, less or any information relevant to our modelling task so they are not a silver bullet for feature selection. They usually present a convenient and condensed representation of our original data when we cannot work with their raw form.
Your question asks about feature transformation in machine learning & statistics. This is actually two separate questions, in line with the two main uses of statistical techniques, To Explain, and To Predict.
Feature transformation on explanatory statistics
You're using a relatively simple statistical technique, like linear regression or PCA, to separate out and explain the various effects you're seeing in the data. Any transformation you perform on the data is something you need to keep track of and include in your final equation.
An example: Imagine you have independent variables $x_1,x_2,x_3$ and dependent variable $y$. Your first step will probably be to do a linear regression to find an equation $y = ax_1 + bx_2 + cx_3 +d$. You do this and find that the $R^2$ is really poor. You plot the raw data and find that there is a lot of heterscedasticity to it, and that the data points are skewed strongly, with some points occurring at orders of magnitude higher than others. This observation leads you to try doing the regression on the logs of the variables $x_1,x_2,x_3,y$.
$log(y) = alog(x_1) + blog(x_2)+ clog(x_3) +d$.
This approach gives you a much higher $R^2$. But now you've actually changed the equation. If you exponentiate everything, you'll find this is equivalent to $y = x_1^a x_2^b x_3^c e^d$. The variables are now multiplied together rather than summed. You've found a different and better equation to fit the data with. By transforming the data, you change the model, and therefore change the explanation.
Feature transformation on predictive statistics.
You're using some form of machine learning process to spit out answers from a set of inputs. Your model is a black box and no one cares what's inside it as long as it is spitting out accurate answers quickly. Feature transformation can be considered part of this black box - the raw info is getting chopped up and compressed anyway so it doesn't matter in the slightest that it goes through a transformation before being fed into the machine.
Feature transformation is important for the training of machine learning tools, usually for mundane, practical reasons. For example, with neural network training (especially if you're using sigmoid or tanh activation functions), you want all your input variables to be transformed so that they are somewhere close to zero (usually between -2 and +2). There is no fundamental statistical reason for this, it's just to make the training process go faster.
Neural networks, like many complicated statistical techniques, use some form of gradient descent to optimise their parameters and actually 'learn'. Gradient descent requires the local topography of the parameter space to be sloped so that it can follow the direction of the slope. Inputting a very large positive or negative number into a sigmoid function results in a number that is very close to either 0 or 1, where the local topography of the parameter space is almost flat. This will cause the algorithm either to learn extremely slowly, or decide it's already at the optimum and declare that it's done learning. Both of these are bad.
Feature transformations on machine learning training are used to massage the data into a format that is better conducive to rapid learning.
Best Answer
Like most aspects of statistics, variable selection is a balancing act.
Manually trimming the list of potential predictor variables can protect against overfitting, as most commonly used variable selection algorithms are context-free - that is, they only look at relationships within the dataset, and can't factor in the wider meanings of variables. This means that an automated algorithm may pick up relationships in a large number of predictor variables that are illusory and won't generalise outside the dataset.
This makes manual elimination of "bad" variables a good initial step in some cases. However:
So yes, while there are ways of automatically selecting variables (for example Stepwise Selection or LASSO regression), they should only be used where appropriate. In the example case, the analyst used their knowledge of the subject matter to eliminate unimportant variables.