Is comparing features using F-regression
the same as correlating features with the label individually and observing the $R^2$ value?
I have often seen my colleagues use an F regression
for feature selection in their machine learning pipeline from sklearn
:
sklearn.feature_selection.SelectKBest(score_func=sklearn.feature_selection.f_regression...)`
Some please tell me – why does it give the same results as just correlating it with the label/depedendent variable?
It is not clear to me the advantage of using F_regression
in feature selection.
Here's my code: I'm using the mtcars
dataset from R
:
import pandas as pd
import numpy as np
from sklearn import feature_selection
from sklearn.linear_model import LinearRegression
#....load mtcars dataset into a pandas dataframe called "df", not shown here for conciseness
# only using these numerical columns as features ['mpg', 'disp', 'drat', 'wt']
# using this column as the label: ['qsec']
model = feature_selection.SelectKBest(score_func=feature_selection.f_regression,\
k=4)
results = model.fit(df[columns], df['qsec'])
print results.scores_
print results.pvalues_
# Using just correlation coefficient:
columns = ['mpg', 'disp', 'drat', 'wt']
for col in columns:
lm = LinearRegression(fit_intercept=True)
lm.fit(df[[col]], df['qsec'])
print lm.score(df[[col]], df['qsec'])
As suspected, the ranking of the features is exactly the same:
scores using f_regression:
[ 6.376702 6.95008354 0.25164249 0.94460378]
scores using coefficient of determination:
0.175296320261
0.18809385182
0.00831830818303
0.0305256382746
As you can see, the second feature is ranked the highest, the first feature is second, the fourth feature is third, and the third feature is last, in both cases.
Is there ever a case where the F_regression
would give different results, or would rank the features differently in some way?
EDIT:
To summarize, I'd like to know if these two rankings of features ever give different results:
1) ranking features by their F-statistic when regressing them with the outcome individually (this is what sklearn does) AND,
2) ranking features by their R-squared value when regressing them with the outcome , again individually.
Best Answer
TL:DR
There won't be a difference if
F-regression
just computes the F statistic and pick the best features. There might be a difference in the ranking, assumingF-regression
does the following:As the correlation will not be the same at each iteration. But you can still get this ranking by just computing the correlation at each step, so why does
F-regression
takes an additional step? It does two things:F-regression
comes in handy.What is a F-test
A F-test (Wikipedia) is a way of comparing the significance of the improvement of a model, with respect to the addition of new variables. You can use it when have a basic model $M_0$ and a more complicated model $M_1$, which contains all variables from $M_0$ and some more. The F-test tells you if $M_1$ is significantly better than $M_0$, with respect to a $p$-value.
To do so, it uses the residual sum of squares as an error measure, and compares the reduction in error with the number of variables added, and the number of observation (more details on Wikipedia). Adding variables, even if they are completely random, is expected to always help the model achieve lower error by adding another dimension. The goal is to figure out if the new features are really helpful or if they are random numbers but still help the model because they add a dimension.
What does
f_regression
doNote that I am not familiar with the Scikit learn implementation, but lets try to figure out what
f_regression
is doing. The documentation states that the procedure is sequential. If the word sequential means the same as in other statistical packages, such as Matlab Sequential Feature Selection, here is how I would expect it to proceed:For now, I think it is a close enough approximation to answer your question; is there a difference between the ranking of
f_regression
and ranking by correlation.If you were to start with constant model $M_0$ and try to find the best model with only one feature, $M_1$, you will select the same feature whether you use
f_regression
or your correlation based approach, as they are both a measure of linear dependency. But if you were to go from $M_0$ to $M_1$ and then to $M_2$, there would be a difference in your scoring.Assume you have three features, $x_1, x_2, x_3$, where both $x_1$ and $x_2$ are highly correlated with the output $y$, but also highly correlated with each other, while $x_3$ is only midly correlated with $y$. Your method of scoring would assign the best scores to $x_1$ and $x_2$, but the sequential method might not. In the first round, it would pick the best feature, say $x_1$, to create $M_1$. Then, it would evaluate both $x_2$ and $x_3$ for $M_2$. As $x_2$ is highly correlated with an already selected feature, most of the information it contains is already incorporated into the model, and therefore the procedure might select $x_3$. While it is less correlated to $y$, it is more correlated to the residuals, the part that $x_1$ does not already explain, than $x_2$. This is how the two procedure you propose are different.
You can still emulate the same effect with your idea by building your model sequentially and measuring the difference in gain for each additional feature instead of comparing them to the constant model $M_0$ as you are doing now. The result would not be different from the
f_regression
results. The reason for this function to exists is to provide this sequential feature selection, and additionnaly converts the result to an F measure which you can use to judge significance.The goal of the F-test is to provide significance level. If you want to make sure the features your are including are significant with respect to your $p$-value, you use an F-test. If you just want to include the $k$ best features, you can use the correlation only.
Additional material: Here is an introduction to the F-test you might find helpful