The HashingVectorizer in scikit-learn doesn't give token counts, but by default gives a normalized count either l1 or l2.
I need the tokenized counts, so I set norm = None
. However, after I do this, I'm no longer getting decimals, but I'm still getting negative numbers. It seems like the negatives can be removed by setting non_negative = True
. However, I don't understand why the negatives are there, or what they mean. I'm not sure if they're corresponding to the token counts. Can someone please help explain? How do I get the HashingVectorizer to return token counts?
You can replicate my results with the following code – I'm using the 20newsgroups dataset which comes with scikit learn:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
from sklearn.feature_extraction.text import HashingVectorizer
# produces normalized results with mean 0 and unit variance
cv = HashingVectorizer(stop_words = 'english')
X_train = cv.fit_transform(twenty_train.data)
print(X_train)
# produces integer results both positive and negative
cv = HashingVectorizer(stop_words = 'english', norm=None)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)
# produces only positive results but not sure if they correspond to counts
cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)
Best Answer
I got my answer from sci-kit learn's mailing list. Here it is:
"it's a mechanism to compensate for hash collisions, see https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute values are token counts for most practical applications (if you don't have too many collisions). There will be a PR shortly to make this more consistent."
(The above response is from Roman Yurchak)