There is a way using the regression coefficients only, you can understand which features most contribute to the prediction of a given input vector.
However you will have to standardize and scale each variable first (i.e. subtract the mean and divide by the standard deviation). Then refitting your model with the standardized & scaled data, the feature with the largest regression coefficient will be the feature that contributes the most to future predictions.
The regression coefficients are comparable after scaling because we have made the units of the features irrelevant, thus a one unit increase in feature $X_1$ corresponds to jumping 1 standard deviation of the unscaled feature.
TL:DR
There won't be a difference if F-regression
just computes the F statistic and pick the best features. There might be a difference in the ranking, assuming F-regression
does the following:
- Start with a constant model, $M_0$
- Try all models $M_1$ consisting of just one feature and pick the best according to the F statistic
- Try all models $M_2$ consisting of $M_1$ plus one other feature and pick the best ...
As the correlation will not be the same at each iteration. But you can still get this ranking by just computing the correlation at each step, so why does F-regression
takes an additional step? It does two things:
- Feature selection: If you want to select the $k$ best features in a Machine learning pipeline, where you only care about accuracy and have measures to adjust under/overfitting, you might only care about the ranking and the additional computation is not useful.
- Test for significance: If you are trying to understand the effect of some variables on an output in a study, you might want to build a linear model, and only include the variables that are significantly improving your model, with respect to some $p$-value. Here,
F-regression
comes in handy.
What is a F-test
A F-test (Wikipedia) is a way of comparing the significance of the improvement of a model, with respect to the addition of new variables. You can use it when have a basic model $M_0$ and a more complicated model $M_1$, which contains all variables from $M_0$ and some more. The F-test tells you if $M_1$ is significantly better than $M_0$, with respect to a $p$-value.
To do so, it uses the residual sum of squares as an error measure, and compares the reduction in error with the number of variables added, and the number of observation (more details on Wikipedia). Adding variables, even if they are completely random, is expected to always help the model achieve lower error by adding another dimension. The goal is to figure out if the new features are really helpful or if they are random numbers but still help the model because they add a dimension.
What does f_regression
do
Note that I am not familiar with the Scikit learn implementation, but lets try to figure out what f_regression
is doing. The documentation states that the procedure is sequential. If the word sequential means the same as in other statistical packages, such as Matlab Sequential Feature Selection, here is how I would expect it to proceed:
- Start with a constant model, $M_0$
- Try all models $M_1$ consisting of just one feature and pick the best according to the F statistic
- Try all models $M_2$ consisting of $M_1$ plus one other feature and pick the best ...
For now, I think it is a close enough approximation to answer your question; is there a difference between the ranking of f_regression
and ranking by correlation.
If you were to start with constant model $M_0$ and try to find the best model with only one feature, $M_1$, you will select the same feature whether you use f_regression
or your correlation based approach, as they are both a measure of linear dependency. But if you were to go from $M_0$ to $M_1$ and then to $M_2$, there would be a difference in your scoring.
Assume you have three features, $x_1, x_2, x_3$, where both $x_1$ and $x_2$ are highly correlated with the output $y$, but also highly correlated with each other, while $x_3$ is only midly correlated with $y$. Your method of scoring would assign the best scores to $x_1$ and $x_2$, but the sequential method might not. In the first round, it would pick the best feature, say $x_1$, to create $M_1$. Then, it would evaluate both $x_2$ and $x_3$ for $M_2$. As $x_2$ is highly correlated with an already selected feature, most of the information it contains is already incorporated into the model, and therefore the procedure might select $x_3$. While it is less correlated to $y$, it is more correlated to the residuals, the part that $x_1$ does not already explain, than $x_2$. This is how the two procedure you propose are different.
You can still emulate the same effect with your idea by building your model sequentially and measuring the difference in gain for each additional feature instead of comparing them to the constant model $M_0$ as you are doing now. The result would not be different from the f_regression
results. The reason for this function to exists is to provide this sequential feature selection, and additionnaly converts the result to an F measure which you can use to judge significance.
The goal of the F-test is to provide significance level. If you want to make sure the features your are including are significant with respect to your $p$-value, you use an F-test. If you just want to include the $k$ best features, you can use the correlation only.
Additional material: Here is an introduction to the F-test you might find helpful
Best Answer
The first thing to note is that you don't use logistic regression as a classifier. The fact that $Y$ is binary has absolutely nothing to do with using this maximum likelihood method to actually classify observations. Once you get past that, concentrate on the gold standard information measure which is a by-product of maximum likelihood: the likelihood ratio $\chi^2$ statistic. You can produce a chart showing the partial contribution of each predictor in terms of its partial $\chi^2$ statistic. These statistics have maximum information/power. You can use the bootstrap to show how hard it is to pick "winners" and "losers" by getting confidence intervals on the ranks of the predictive information provided by each predictor once the other predictors are accounted for. An example is in Section 5.4 of my course notes - click on Handouts.
If you have highly correlated features you can do a "chunk test" to combine their influence. A chart that does this is given in Figure 15.11 where
size
represents the combined contribution of 4 separate predictors.