Solved – For feature selection in linear regression model, can I use coefficient estimates

feature selectionp-valuepythonregressionscikit learn

I'm doing Analytics Edge course from EDX. The course is using R while I'm using Python.

In linear regression, in order to improve the model, we have to figure out the most significant features.

The course is using the summary function in R to look at the dots signifying the importance of the feature and the p-values. No such thing exists in sklearn.

enter image description here

So I'm using coefficients to see the most significant features.
But I'm not sure I should trust coefficients to select the most significant features (even though for this problem, they are in agreement)

So is coefficients from linearRegression in sklearn reliable in determining the significance of the features?
Are p-values themselves reliable in detecting the significant features?

(I know of statsmodels and do not wish to use)

Best Answer

To begin with, just to put the issue aside: clearly, if the features are not normalized to 0 mean and unit variance, it's easy to build cases where the coefficient means very little. In general, if you take a feature and multiply it by $\alpha$, a regressor will divide the coefficient by $\alpha$, for example.

Even when the variables are all normalized, large coefficients can mean very little. Say that $x$ is some hidden feature somewhat correlated with $y$, and $z$ and $w$ are observed featured which are slightly noisy versions of $x$, the regression matrix will be not very well defined, and you could get large-magnitude coefficients for $z$ and $w$ (perhaps with opposite signs). Regularization is usually used precisely to avoid this.

Perhaps sklearn.feature_selection.f_regression is similar to what you're looking for. It summarizes, for each individual feature, both the f-score and the p-value. Alternatively, for any regression scheme, a "black box" approach could be to build the model for all features except $x$, and assess its performance (using cross validation). You could then rank the features based on the performance.

Feature importance is a bit trick to define. In the above two schemes, if $x_i$ is the $i$the resulting "most important" feature, it does not necessarily mean that using $x_1, \ldots x_{i - 1}$, it is indeed the next most important one (perhaps its information is already contained in the preceding ones).