Solved – How to scale for SelectKBest for feature selection

feature selectionscikit learn

I am trying SelectKBest to select out most important features:

# SelectKBest: 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
sel = SelectKBest(chi2, k='all')

# Load Dataset: 
from sklearn import datasets
iris = datasets.load_iris() 

# Run SelectKBest on scaled_iris.data
newx = sel.fit_transform(iris.data, iris.target)
print(newx[0:5])

It works all right and the output is:

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

However, when I try to use SelectKBest on scaled data, I get error:

# Scale iris.data
from sklearn.preprocessing import StandardScaler    
scaler = StandardScaler()
X = scaler.fit_transform(iris.data) 

# Run SelectKBest on scaled_iris.data
newx = sel.fit_transform(X, iris.target)

Output is error:

ValueError: Input X must be non-negative.

How can I scale the data so that there are no negative values for this purpose? Or is scaling completely unnecessary while selecting features from a dataset?

Best Answer

I think the problem is that you're using the chi2 scoring function. If you instead use the f_classif scoring function, then there will not be any errors due to having negative values in your dataset. So if you want to use chi2, then you would need to transform your data somehow to get rid of negatives (you could normalize it so that all the values fall between 0 and 1, or you could set your minimum value to 0, or do countless things to remove the negatives). If you're already using some sort of normalized values such as z-scores and therefore don't want to do any more normalization, then you should consider using the ANOVA (f_classif) scoring function for your feature selection.

So essentially, to answer the question directly, additional scaling to get rid of negatives may not be necessary for selecting features from a dataset. If you are using z-score normalization or some other normalization that uses negatives (maybe your data falls between -1 and +1), you could just use f_classif scoring function which doesn't require only positive numbers.

As one example of how you can make the data scale to use chi2: When I've used the chi2 scoring function in sklearn, I start with data that are not normalized at all. I then normalize the data so that it falls between 0 and 1 very simply by doing this:

normed_data= (data - data.min(0)) / data.ptp(0)

Here, data.min(0) returns the minimum value of each data column and data.ptp(0) returns the range of each data column. So normed_data ends up being a matrix where every column has been independently normalized to fall within the range of [0, 1].

Related Question