Solved – How to convert predicted probabilities into predicted response

logisticprediction

How can I convert predicted probabilities of a logit model into predicted binary response ? Can I consider 0.5 as cut point to convert probabilities to binary variable (0,1).
Or should I use binomial distribution to generate binary variable where predicted probabilities are used as success probability. By the way I am doing it for cross validation of the fitted model.

Best Answer

usually it is feasible to iterate over predicted probabilities with various cut-off points from 0 to 1 with an increment of, say 0.01, and to construct some metric that is of interest to you (i.e. which you want to maximize). Be it accuracy, sensitivity, specificity, K-S score or value of other variable that may be not part of your model.

Then plot the cutoff VS that variable and you will have an idea which cut-off works best for you. And once the cutoff is determined just perform the cross validation with that value.

Related Solutions

Solved – Adding predicted probabilites from logistic regression instead of using cut value

For what you're trying to do, the second approach seems better to me. If your model is correctly specified, then you'll converge to the right answer as $n \rightarrow \infty$. Meanwhile, the first approach, i.e., cutting at 0.5, could go horribly wrong if you have lots of probabilities that are systematically around 40% or 60%.

One follow-up, though: Are you sure you will always have 0 or 1 purchases per time period? Or is it also possible for people to buy 2, 3, etc. items? If so, I'd recommend doing Poisson regression instead of logistic regression. Poisson regression is the standard way of predicting quantities of the form "number of times something happens."

Solved – How to get predicted values from cross validation

I do not know about MATLAB, but in scikit-learn you can get the probabilities by doing a manual nested cross-validation.

The code would be roughly,

# Dividing data set for outer cv
cv_outer = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 43)

# Outer cv beginning
for (train, test) in cv_outer.split(X, y):
    # parameters for grid search
    pipe_logis = Pipeline([('logis', LogisticRegression(random_state = 1))])
    param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    param_grid = [{'logis__C': param_range, 'logis__penalty': ['l1']}, 
             {'logis__C': param_range, 'logis__penalty': ['l2']}]

    #Dividing data set for inner cv
    cv_inner = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

    # Inner cv done by these 2 lines
    gs = GridSearchCV(estimator=pipe_logis, param_grid=param_grid, scoring='f1', 
         cv= cv_inner)
    gs.fit(X[train], y[train]) #we are using only the training fold for 
                                hyper-parameter tuning

    # get best parameters
    print gs.best_params_

    # predict with best parameters
    scores = gs.best_estimator_.predict(X[test])

    # These are your probabilities. Here I am getting it for the test fold, 
    but you can get probabilities for your training fold.
    probas_ = gs.best_estimator_.predict_proba(X[test])

Also, a nested cross-validation is only for an unbiased performance measurement. After getting a sense of performance measurement for various algorithms, you choose the algorithm with the best performance. Then you do a k-fold cross-validation with the whole dataset to finally choose your best set of hyper-parameters for that algorithm. You do not want to use the performance measurement from this later k-fold cross-validation.

Of course, all of this is for small datasets. If you have a large dataset, you can just divide it between training set, validation set and test set. Then you use the validation set for hyper-parameter tuning.

Best Answer

Related Solutions

Solved – Adding predicted probabilites from logistic regression instead of using cut value

Solved – How to get predicted values from cross validation

Related Question