I have two set of data that I want to transform using the count vectorizer. The first is for product_title and the second is for product_description. I am attempting to auto-classify the products into around 5 different categories and around 50 different sub-categories. I expect the text in the product_title to be more relevant so I would want the ability to weight it differently then the text in the product_descriptions. I am trying to create a script to combine them. So far this is what I have:
from __future__ import print_function
from pprint import pprint
from time import time
import logging
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
print(__doc__)
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,format='%(asctime)s %(levelname)s %(message)s')
data = {'data':[],'target1':[],'target2':[]}
tempdata = {'data1':[],'data2':[]}
with open('coors.csv', mode='r') as infile:
reader = csv.reader(infile)
for row in reader:
data['data'].append(row[0])
tempdata['data1'].append(row[-2])
tempdata['data2'].append(row[-1])
###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
#'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
#'tfidf__use_idf': (True, False),
#'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
#'clf__n_iter': (10, 50, 80),
}
if __name__ == "__main__":
# multiprocessing requires the fork to happen in a __main__ protected
# block
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(data.data, data.target)
print("done in %0.3fs" % (time() - t0))
print()
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
The problem I am having is with:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
I would need to somehow run something like:
combined_features = FeatureUnion([('title', data['data1']), ('description', data['data2']))
And I would need to run combined_features so that a CountVectorizer is applied to each data['data1'] and data['data2'] and then the features are combined and run through the pipeline. I would prefer to do this so that I can just feed to feature sets and the targets to the pipeline and it would do the rest. The result would allow duplicates. For example theoretically there could be features like computer_title and computer_description. Any advice on how to integrate this would be helpful.
Best Answer
The Feature Union with Heterogeneous Data Sources example from the scikit-learn docs also has a simple
ItemSelector
Transformer
that basically picks one feature from a dict (or other structure) to work with, which could be combined with a FeatureUnion.