Solved – Difference between selecting features based on “F regression” and based on $R^2$ values

f-testfeature selectionpythonr-squaredscikit learn

Is comparing features using F-regression the same as correlating features with the label individually and observing the $R^2$ value?

I have often seen my colleagues use an F regression for feature selection in their machine learning pipeline from sklearn:

sklearn.feature_selection.SelectKBest(score_func=sklearn.feature_selection.f_regression...)`

Some please tell me – why does it give the same results as just correlating it with the label/depedendent variable?

It is not clear to me the advantage of using F_regression in feature selection.

Here's my code: I'm using the mtcars dataset from R:

import pandas as pd
import numpy as np
from sklearn import feature_selection
from sklearn.linear_model import LinearRegression

#....load mtcars dataset into a pandas dataframe called "df", not shown here for conciseness

# only using these numerical columns as features ['mpg', 'disp', 'drat', 'wt']
# using this column as the label:  ['qsec']

model = feature_selection.SelectKBest(score_func=feature_selection.f_regression,\
                                      k=4)

results = model.fit(df[columns], df['qsec'])

print results.scores_
print results.pvalues_

# Using just correlation coefficient:

columns = ['mpg', 'disp', 'drat', 'wt']
for col in columns:
    lm = LinearRegression(fit_intercept=True)
    lm.fit(df[[col]], df['qsec'])
    print lm.score(df[[col]], df['qsec'])

As suspected, the ranking of the features is exactly the same:

scores using f_regression:

[ 6.376702    6.95008354  0.25164249  0.94460378]


 scores using coefficient of determination:

0.175296320261  
0.18809385182
0.00831830818303
0.0305256382746

As you can see, the second feature is ranked the highest, the first feature is second, the fourth feature is third, and the third feature is last, in both cases.

Is there ever a case where the F_regression would give different results, or would rank the features differently in some way?

EDIT:
To summarize, I'd like to know if these two rankings of features ever give different results:

1) ranking features by their F-statistic when regressing them with the outcome individually (this is what sklearn does) AND,

2) ranking features by their R-squared value when regressing them with the outcome , again individually.

Best Answer

TL:DR

There won't be a difference if F-regression just computes the F statistic and pick the best features. There might be a difference in the ranking, assuming F-regression does the following:

Start with a constant model, $M_0$
Try all models $M_1$ consisting of just one feature and pick the best according to the F statistic
Try all models $M_2$ consisting of $M_1$ plus one other feature and pick the best ...

As the correlation will not be the same at each iteration. But you can still get this ranking by just computing the correlation at each step, so why does F-regression takes an additional step? It does two things:

Feature selection: If you want to select the $k$ best features in a Machine learning pipeline, where you only care about accuracy and have measures to adjust under/overfitting, you might only care about the ranking and the additional computation is not useful.
Test for significance: If you are trying to understand the effect of some variables on an output in a study, you might want to build a linear model, and only include the variables that are significantly improving your model, with respect to some $p$-value. Here, F-regression comes in handy.

What is a F-test

A F-test (Wikipedia) is a way of comparing the significance of the improvement of a model, with respect to the addition of new variables. You can use it when have a basic model $M_0$ and a more complicated model $M_1$, which contains all variables from $M_0$ and some more. The F-test tells you if $M_1$ is significantly better than $M_0$, with respect to a $p$-value.

To do so, it uses the residual sum of squares as an error measure, and compares the reduction in error with the number of variables added, and the number of observation (more details on Wikipedia). Adding variables, even if they are completely random, is expected to always help the model achieve lower error by adding another dimension. The goal is to figure out if the new features are really helpful or if they are random numbers but still help the model because they add a dimension.

What does f_regression do

Note that I am not familiar with the Scikit learn implementation, but lets try to figure out what f_regression is doing. The documentation states that the procedure is sequential. If the word sequential means the same as in other statistical packages, such as Matlab Sequential Feature Selection, here is how I would expect it to proceed:

Start with a constant model, $M_0$
Try all models $M_1$ consisting of just one feature and pick the best according to the F statistic
Try all models $M_2$ consisting of $M_1$ plus one other feature and pick the best ...

For now, I think it is a close enough approximation to answer your question; is there a difference between the ranking of f_regression and ranking by correlation.

If you were to start with constant model $M_0$ and try to find the best model with only one feature, $M_1$, you will select the same feature whether you use f_regression or your correlation based approach, as they are both a measure of linear dependency. But if you were to go from $M_0$ to $M_1$ and then to $M_2$, there would be a difference in your scoring.

Assume you have three features, $x_1, x_2, x_3$, where both $x_1$ and $x_2$ are highly correlated with the output $y$, but also highly correlated with each other, while $x_3$ is only midly correlated with $y$. Your method of scoring would assign the best scores to $x_1$ and $x_2$, but the sequential method might not. In the first round, it would pick the best feature, say $x_1$, to create $M_1$. Then, it would evaluate both $x_2$ and $x_3$ for $M_2$. As $x_2$ is highly correlated with an already selected feature, most of the information it contains is already incorporated into the model, and therefore the procedure might select $x_3$. While it is less correlated to $y$, it is more correlated to the residuals, the part that $x_1$ does not already explain, than $x_2$. This is how the two procedure you propose are different.

You can still emulate the same effect with your idea by building your model sequentially and measuring the difference in gain for each additional feature instead of comparing them to the constant model $M_0$ as you are doing now. The result would not be different from the f_regression results. The reason for this function to exists is to provide this sequential feature selection, and additionnaly converts the result to an F measure which you can use to judge significance.

The goal of the F-test is to provide significance level. If you want to make sure the features your are including are significant with respect to your $p$-value, you use an F-test. If you just want to include the $k$ best features, you can use the correlation only.

Additional material: Here is an introduction to the F-test you might find helpful

Best Answer

Related Solutions

Solved – Optimal feature selection for MAPE criteria with RandomForest cross-validation

Solved – Is F test used for feature selection only for features with numerical and continuous domain

Related Question