Solved – splitting pipeline in sklearn

scikit learn

I have a pandas dataframe df with the following features:
visitor_id, feature_1, feature_2, …, feature_100, truth_labels

I implemented the following model on sklearn:

1st step: scaling df.drop(['visitor_id', 'truth_labels'], axis=1) using sklearn.preprocessing.StandardScaler()

2nd step: clustering df.drop(['visitor_id', 'truth_labels'], axis=1) using sklearn.cluster.MiniBatchKMeans() in 10 clusters. Set df['cluster'] to corresponding clusters.

3rd step: fit 10 sklearn.linear_model.LogisticRegression() on df.drop(['visitor_id'], axis=1), one per cluster.

I have two questions:

1- Is it possible to build a Pipeline in order to aggregate these three steps? In particular, how can I specify that I want to train 10 distinct sklearn.linear_model.LogisticRegression() models on my data splitted by clusters?

2- Is it possible to save this full pipeline? How?

Best Answer

Yes.
Yes.

1) You can implement your own logic for pipeline step. You can find detailed example in sklearn documentation.

In your case it would be:

from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline

class TenLogisticRegressionsClassifier(BaseEstimator, TransformerMixin):


    def __init__(self, N=10):
        self.estimators = { i:LogisticRegression() for i in range(N)  }


    def fit(self, X, y=None):
        for k,v in self.estimators.items():
            # Here some logic to divide dataset
            v.fit(newX, newY)

    def predict(self, X, y=None):
        for k,v in self.estimators.items():
            # Here some logic to divide dataset
            v.predict(newX)

pipeline = Pipeline([('something',   TenLogisticRegressionsClassifier())])
# or
pipeline = make_pipeline(TenLogisticRegressionsClassifier())

2) Answer is on Stack Overflow

from sklearn.externals import joblib
joblib.dump(pipeline, 'pipeline.pkl')
# or, if you want 1 file:
joblib.dump(pipeline, 'filename.pkl', compress = 1)

Then you can load it and use:

pipeline_loaded = joblib.load('filename.pkl')

Related Solutions

Solved – Sklearn Combine Multiple Feature Sets in Pipeline

The Feature Union with Heterogeneous Data Sources example from the scikit-learn docs also has a simple ItemSelector Transformer that basically picks one feature from a dict (or other structure) to work with, which could be combined with a FeatureUnion.

class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to sklearn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

Solved – classification problem in sklearn

It is correct to use the method predict_proba to get the estimated probabilities. Just to make it a bit clearer, consider the example given here (http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html).

Below I copy just the relevant lines, so that anyone can copy & paste it, and reproduce what I say next,

import numpy as np
from sklearn import linear_model, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
h = .02  # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, Y)

In the iris dataset you have three classes (0, 1, 2 = ['setosa', 'versicolor', 'virginica']), contained in Y. When you call,

logreg.predict_proba(X)

you get an array of arrays of probabilities, each element being p(class|x),

array([[  9.05823905e-01,   6.81672013e-02,   2.60088939e-02],
       [  7.64631786e-01,   2.16376590e-01,   1.89916235e-02],
       [  8.46908157e-01,   1.42190177e-01,   1.09016662e-02],
       [  8.15654921e-01,   1.75608861e-01,   8.73621791e-03],
       ...

The other alternatives are predict_log_proba() to get log P(class|x) and predict() to get the class with highest probability among the three for a given sample.

Best Answer

Related Solutions

Solved – Sklearn Combine Multiple Feature Sets in Pipeline

Solved – classification problem in sklearn

Related Question