Solved – Why does scikit learn’s HashingVectorizer give negative values

pythonscikit learntext mining

The HashingVectorizer in scikit-learn doesn't give token counts, but by default gives a normalized count either l1 or l2.

I need the tokenized counts, so I set norm = None. However, after I do this, I'm no longer getting decimals, but I'm still getting negative numbers. It seems like the negatives can be removed by setting non_negative = True. However, I don't understand why the negatives are there, or what they mean. I'm not sure if they're corresponding to the token counts. Can someone please help explain? How do I get the HashingVectorizer to return token counts?

You can replicate my results with the following code – I'm using the 20newsgroups dataset which comes with scikit learn:

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
from sklearn.feature_extraction.text import HashingVectorizer

# produces normalized results with mean 0 and unit variance
cv = HashingVectorizer(stop_words = 'english')
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

# produces integer results both positive and negative
cv = HashingVectorizer(stop_words = 'english', norm=None)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

# produces only positive results but not sure if they correspond to counts
cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

Best Answer

I got my answer from sci-kit learn's mailing list. Here it is:

"it's a mechanism to compensate for hash collisions, see https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute values are token counts for most practical applications (if you don't have too many collisions). There will be a PR shortly to make this more consistent."

(The above response is from Roman Yurchak)