Almost any approach that does some form of model selection and then does further analyses as if no model selection had previously happened typically has poor properties. Unless there are compelling theoretical arguments backed up by evidence from e.g. extensive simulation studies for realistic sample sizes and feature versus sample size ratios to show that this is an exception, it is likely that such an approach will have unsatisfactory properties. I am not aware of any such positive evidence for this approach, but perhaps someone else is. Given that there are reasonable alternatives that achieve all desired goals (e.g. the elastic net), it this approach is hard to justify using such a suspect ad-hoc approach instead.
To begin with, just to put the issue aside: clearly, if the features are not normalized to 0 mean and unit variance, it's easy to build cases where the coefficient means very little. In general, if you take a feature and multiply it by $\alpha$, a regressor will divide the coefficient by $\alpha$, for example.
Even when the variables are all normalized, large coefficients can mean very little. Say that $x$ is some hidden feature somewhat correlated with $y$, and $z$ and $w$ are observed featured which are slightly noisy versions of $x$, the regression matrix will be not very well defined, and you could get large-magnitude coefficients for $z$ and $w$ (perhaps with opposite signs). Regularization is usually used precisely to avoid this.
Perhaps sklearn.feature_selection.f_regression
is similar to what you're looking for. It summarizes, for each individual feature, both the f-score and the p-value. Alternatively, for any regression scheme, a "black box" approach could be to build the model for all features except $x$, and assess its performance (using cross validation). You could then rank the features based on the performance.
Feature importance is a bit trick to define. In the above two schemes, if $x_i$ is the $i$the resulting "most important" feature, it does not necessarily mean that using $x_1, \ldots x_{i - 1}$, it is indeed the next most important one (perhaps its information is already contained in the preceding ones).
Best Answer
I interpret OP's one-sentence question to mean that OP wishes to understand the desirability of the following analysis pipeline:
I don't think this pipeline will accomplish what you'd like. Variables that are important in random forest don't necessarily have any sort of linearly additive relationship with the outcome. This remark shouldn't be surprising: it's what makes random forest so effective at discovering nonlinear relationships.
Here's an example. I created a classification problem with 10 noise features, two "signal" features, and a circular decision boundary.
And when we apply the RF model, we are not surprised to find that these features are easily picked out as important by the model. (NB: this model isn't tuned at all.)
But when we down-select to just these two, useful features, the resulting linear model is awful.
The important part of the summary is the comparison of the residual deviance and the null deviance. We can see that the model does basically nothing to "move" the deviance. Moreover, the coefficient estimates are essentially zero.
What accounts for the wild difference between the two models? Well, clearly the decision boundary we're trying to learn is not a linear function of the two "signal" features. Obviously if you knew the functional form of the decision boundary prior to estimating the regression, you could apply some transformation to encode the data in a way that regression could then discover... (But I've never known the form of the boundary ahead of time in any real-world problem.) Since we're only working with two signal features in this case, a synthetic data set without noise in the class labels, that boundary between classes is very obvious in our plot. But it's less obvious when working with real data in a realistic number of dimensions.
Moreover, in general, random forest can fit different models to different subsets of the data. In a more complicated example, it won't be obvious what's going on from a single plot at all, and building a linear model of similar predictive power will be even harder.
Because we're only concerned with two dimensions, we can make a prediction surface. As expected, the random model learns that the neighborhood around the origin is important.
As implied by our abysmal model output, the prediction surface for the reduced-variable logistic regression model is basically flat.
HongOoi notes that the class membership isn't a linear function of the features, but that it a linear function is under a transformation. Because the decision boundary is $1=x_1^2+x_2^2,$ if we square these features, we will be able to build a more useful linear model. This is deliberate. While the RF model can find signal in those two features without transformation, the analyst has to be more specific to get similarly helpful results in the GLM. Perhaps that's sufficient for OP: finding a useful set of transformations for 2 features is easier than 12. But my point is that even if a transformation will yield a useful linear model, RF feature importance won't suggest the transformation on its own.