I have a pandas dataframe df
with the following features:
visitor_id, feature_1, feature_2, …, feature_100, truth_labels
I implemented the following model on sklearn:
1st step: scaling df.drop(['visitor_id', 'truth_labels'], axis=1)
using sklearn.preprocessing.StandardScaler()
2nd step: clustering df.drop(['visitor_id', 'truth_labels'], axis=1)
using sklearn.cluster.MiniBatchKMeans()
in 10 clusters. Set df['cluster']
to corresponding clusters.
3rd step: fit 10 sklearn.linear_model.LogisticRegression()
on df.drop(['visitor_id'], axis=1)
, one per cluster.
I have two questions:
1- Is it possible to build a Pipeline
in order to aggregate these three steps? In particular, how can I specify that I want to train 10 distinct sklearn.linear_model.LogisticRegression()
models on my data splitted by clusters?
2- Is it possible to save this full pipeline? How?
Best Answer
1) You can implement your own logic for pipeline step. You can find detailed example in sklearn documentation.
In your case it would be:
2) Answer is on Stack Overflow
Then you can load it and use: