Solved – How to scale for SelectKBest for feature selection

feature selectionscikit learn

I am trying SelectKBest to select out most important features:

# SelectKBest: 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
sel = SelectKBest(chi2, k='all')

# Load Dataset: 
from sklearn import datasets
iris = datasets.load_iris() 

# Run SelectKBest on scaled_iris.data
newx = sel.fit_transform(iris.data, iris.target)
print(newx[0:5])

It works all right and the output is:

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

However, when I try to use SelectKBest on scaled data, I get error:

# Scale iris.data
from sklearn.preprocessing import StandardScaler    
scaler = StandardScaler()
X = scaler.fit_transform(iris.data) 

# Run SelectKBest on scaled_iris.data
newx = sel.fit_transform(X, iris.target)

Output is error:

ValueError: Input X must be non-negative.

How can I scale the data so that there are no negative values for this purpose? Or is scaling completely unnecessary while selecting features from a dataset?

Best Answer

I think the problem is that you're using the chi2 scoring function. If you instead use the f_classif scoring function, then there will not be any errors due to having negative values in your dataset. So if you want to use chi2, then you would need to transform your data somehow to get rid of negatives (you could normalize it so that all the values fall between 0 and 1, or you could set your minimum value to 0, or do countless things to remove the negatives). If you're already using some sort of normalized values such as z-scores and therefore don't want to do any more normalization, then you should consider using the ANOVA (f_classif) scoring function for your feature selection.

So essentially, to answer the question directly, additional scaling to get rid of negatives may not be necessary for selecting features from a dataset. If you are using z-score normalization or some other normalization that uses negatives (maybe your data falls between -1 and +1), you could just use f_classif scoring function which doesn't require only positive numbers.

As one example of how you can make the data scale to use chi2: When I've used the chi2 scoring function in sklearn, I start with data that are not normalized at all. I then normalize the data so that it falls between 0 and 1 very simply by doing this:

normed_data= (data - data.min(0)) / data.ptp(0)

Here, data.min(0) returns the minimum value of each data column and data.ptp(0) returns the range of each data column. So normed_data ends up being a matrix where every column has been independently normalized to fall within the range of [0, 1].

Related Solutions

Feature Selection – Using SelectKBest for Feature Selection in Python with SciKit Learn

No, SelectKBest works differently.

It takes as a parameter a score function, which must be applicable to a pair ($X$, $y$). The score function must return an array of scores, one for each feature $X[:, i]$ of $X$ (additionally, it can also return p-values, but these are neither needed nor required). SelectKBest then simply retains the first $k$ features of $X$ with the highest scores.

So, for example, if you pass chi2 as a score function, SelectKBest will compute the chi2 statistic between each feature of $X$ and $y$ (assumed to be class labels). A small value will mean the feature is independent of $y$. A large value will mean the feature is non-randomly related to $y$, and so likely to provide important information. Only $k$ features will be retained.

Finally, SelectKBest has a default behaviour implemented, so you can write select = SelectKBest() and then call select.fit_transform(X, y) (in fact I saw people do this). In this case SelectKBest uses the f_classif score function. This interpretes the values of $y$ as class labels and computes, for each feature $X[:, i]$ of $X$, an $F$-statistic. The formula used is exactly the one given here: one way ANOVA F-test, with $K$ the number of distinct values of $y$. A large score suggests that the means of the $K$ groups are not all equal. This is not very informative, and is true only when some rather stringent conditions are met: for example, the values $X[:, i]$ must come from normally distributed populations, and the population variance of the $K$ groups must be the same. I don't see why this should hold in practice, and without this assumption the $F$-values are meaningless. So using SelectKBest() carelessly might throw out many features for the wrong reasons.

Solved – How to correctly interpret f-regression values during feature selection

Pick the variables with the highest F-statistic. If you want, you can use a more handy function, like feature_selection.SelectPercentile.

selector = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
selector.fit(X_train, y_train)
selector.get_support(True)

The method get_support (with True) will return the indexes of the selected features, according to the percentile.

Best Answer

Related Solutions

Feature Selection – Using SelectKBest for Feature Selection in Python with SciKit Learn

Solved – How to correctly interpret f-regression values during feature selection

Related Question