Solved – splitting pipeline in sklearn

scikit learn

I have a pandas dataframe df with the following features:
visitor_id, feature_1, feature_2, …, feature_100, truth_labels

I implemented the following model on sklearn:

1st step: scaling df.drop(['visitor_id', 'truth_labels'], axis=1) using sklearn.preprocessing.StandardScaler()

2nd step: clustering df.drop(['visitor_id', 'truth_labels'], axis=1) using sklearn.cluster.MiniBatchKMeans() in 10 clusters. Set df['cluster'] to corresponding clusters.

3rd step: fit 10 sklearn.linear_model.LogisticRegression() on df.drop(['visitor_id'], axis=1), one per cluster.

I have two questions:

1- Is it possible to build a Pipeline in order to aggregate these three steps? In particular, how can I specify that I want to train 10 distinct sklearn.linear_model.LogisticRegression() models on my data splitted by clusters?

2- Is it possible to save this full pipeline? How?

Best Answer

  1. Yes.
  2. Yes.

1) You can implement your own logic for pipeline step. You can find detailed example in sklearn documentation.

In your case it would be:

from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline

class TenLogisticRegressionsClassifier(BaseEstimator, TransformerMixin):


    def __init__(self, N=10):
        self.estimators = { i:LogisticRegression() for i in range(N)  }


    def fit(self, X, y=None):
        for k,v in self.estimators.items():
            # Here some logic to divide dataset
            v.fit(newX, newY)

    def predict(self, X, y=None):
        for k,v in self.estimators.items():
            # Here some logic to divide dataset
            v.predict(newX)

pipeline = Pipeline([('something',   TenLogisticRegressionsClassifier())])
# or
pipeline = make_pipeline(TenLogisticRegressionsClassifier())

2) Answer is on Stack Overflow

from sklearn.externals import joblib
joblib.dump(pipeline, 'pipeline.pkl')
# or, if you want 1 file:
joblib.dump(pipeline, 'filename.pkl', compress = 1)

Then you can load it and use:

pipeline_loaded = joblib.load('filename.pkl')