Solved – How to get predicted values from cross validation

cross-validationglmnethyperparameterlogisticpredictive-models

This question is regarding cross validation and prediction with regularized logistic regression, so by parameters here I mean the beta-coefficients for each predictor variable, for output I get predicted probabilities of belonging to a group, and for performance measure I use AUC. I am using glmnet on matlab.

For my training set I have group A (normal group) and group B (patient group), and for my validation set I have group C (high-risk group that later converts to patient group) and group D (high-risk group that later does not convert). I want to take parameters that best differentiate group A and B to predict conversion to patient group among people in group C and D.

To this end, I have done 10-fold cross validation on group A and B to find the best lambda (lambda_1se to be exact), and I am taking the parameters at that lambda value and applying them to Group C and D to get predicted probability values (via cvglmnet and cvglmnetPredict).

Here, it seems that:

  1. The best lambda is the only thing that will be searched for from the CV, much like hyperparameter optimization that would happen in an inner loop of a nested cross validation.
  2. After finding the best lambda value from CV, my whole training dataset (Group A,B) is retrained to get parameters for that specific lambda value, and those parameters are used to predict outcome in my validation dataset.

    Is my understanding correct?

What I would like to do is make a boxplot of predicted probabilities of groups A~D so that I can see the trend of predicted values across those groups (ideally the values would be gradiently descending from patient-> highrisk-later convert -> high risk-not convert -> normal).

Here is my main question:
I have already gotten the predicted probability values from groups C and D, but how do I get the predicted probability values from group A and B (the training set on which I did CV on)?

I first thought of taking the predicted probabilities calculated from the left-out test set in each of the 10 folds. However, my current understanding is that if I were to do that, each of the test sets' predicted probabilities would be based on different lambda values, since each training set in each fold would have had different optimal lambda values.

Best Answer

I do not know about MATLAB, but in scikit-learn you can get the probabilities by doing a manual nested cross-validation.

The code would be roughly,

# Dividing data set for outer cv
cv_outer = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 43)

# Outer cv beginning
for (train, test) in cv_outer.split(X, y):
    # parameters for grid search
    pipe_logis = Pipeline([('logis', LogisticRegression(random_state = 1))])
    param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    param_grid = [{'logis__C': param_range, 'logis__penalty': ['l1']}, 
             {'logis__C': param_range, 'logis__penalty': ['l2']}]

    #Dividing data set for inner cv
    cv_inner = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

    # Inner cv done by these 2 lines
    gs = GridSearchCV(estimator=pipe_logis, param_grid=param_grid, scoring='f1', 
         cv= cv_inner)
    gs.fit(X[train], y[train]) #we are using only the training fold for 
                                hyper-parameter tuning

    # get best parameters
    print gs.best_params_

    # predict with best parameters
    scores = gs.best_estimator_.predict(X[test])

    # These are your probabilities. Here I am getting it for the test fold, 
    but you can get probabilities for your training fold.
    probas_ = gs.best_estimator_.predict_proba(X[test])

Also, a nested cross-validation is only for an unbiased performance measurement. After getting a sense of performance measurement for various algorithms, you choose the algorithm with the best performance. Then you do a k-fold cross-validation with the whole dataset to finally choose your best set of hyper-parameters for that algorithm. You do not want to use the performance measurement from this later k-fold cross-validation.

Of course, all of this is for small datasets. If you have a large dataset, you can just divide it between training set, validation set and test set. Then you use the validation set for hyper-parameter tuning.