For what you're trying to do, the second approach seems better to me. If your model is correctly specified, then you'll converge to the right answer as $n \rightarrow \infty$. Meanwhile, the first approach, i.e., cutting at 0.5, could go horribly wrong if you have lots of probabilities that are systematically around 40% or 60%.
One follow-up, though: Are you sure you will always have 0 or 1 purchases per time period? Or is it also possible for people to buy 2, 3, etc. items? If so, I'd recommend doing Poisson regression instead of logistic regression. Poisson regression is the standard way of predicting quantities of the form "number of times something happens."
I do not know about MATLAB, but in scikit-learn you can get the probabilities by doing a manual nested cross-validation.
The code would be roughly,
# Dividing data set for outer cv
cv_outer = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 43)
# Outer cv beginning
for (train, test) in cv_outer.split(X, y):
# parameters for grid search
pipe_logis = Pipeline([('logis', LogisticRegression(random_state = 1))])
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'logis__C': param_range, 'logis__penalty': ['l1']},
{'logis__C': param_range, 'logis__penalty': ['l2']}]
#Dividing data set for inner cv
cv_inner = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
# Inner cv done by these 2 lines
gs = GridSearchCV(estimator=pipe_logis, param_grid=param_grid, scoring='f1',
cv= cv_inner)
gs.fit(X[train], y[train]) #we are using only the training fold for
hyper-parameter tuning
# get best parameters
print gs.best_params_
# predict with best parameters
scores = gs.best_estimator_.predict(X[test])
# These are your probabilities. Here I am getting it for the test fold,
but you can get probabilities for your training fold.
probas_ = gs.best_estimator_.predict_proba(X[test])
Also, a nested cross-validation is only for an unbiased performance measurement. After getting a sense of performance measurement for various algorithms, you choose the algorithm with the best performance. Then you do a k-fold cross-validation with the whole dataset to finally choose your best set of hyper-parameters for that algorithm. You do not want to use the performance measurement from this later k-fold cross-validation.
Of course, all of this is for small datasets. If you have a large dataset, you can just divide it between training set, validation set and test set. Then you use the validation set for hyper-parameter tuning.
Best Answer
usually it is feasible to iterate over predicted probabilities with various cut-off points from 0 to 1 with an increment of, say 0.01, and to construct some metric that is of interest to you (i.e. which you want to maximize). Be it accuracy, sensitivity, specificity, K-S score or value of other variable that may be not part of your model.
Then plot the cutoff VS that variable and you will have an idea which cut-off works best for you. And once the cutoff is determined just perform the cross validation with that value.