Appropriate way to get cross validated performance metrics

auccalibrationcross-validationlogisticpredictive-models

For cross-validation of a logistic regression classifier, it seems to me that there are a number of different approaches to calculating each performance metric:

The performance metric is calculated separately in each left out fold, repeated across k folds, then averaged.
The probabilities are calculated in each left out fold, repeated across k folds, concatenated into a long vector and a single performance metric calculated.
Same as 2 but instead of concatenating the probabilities, store the indexes and average the predicted probabilities for each outcome instance. Calculate a single performance metric from the averaged probabilities.

This question has been asked before with respect to AUCs here. This question considered the first two options above and there was no consensus as to the correct method. The accepted answer suggested the first method was correct but the answer with the most upvotes suggested the second method and referenced a paper, "Apples-to-Apples in Cross-Validation Studies:
Pitfalls in Classifier Performance Measurement", to support this.

Another question asks the same question and the answer suggests there is merit to both the first and second approaches.

This question is also relevant.

The issue is also addressed with reference to AUCs in Statistical Evaluation of Diagnostic Performance: Topics in ROC Analysis p204:

the machine learning community often uses other strategies to calculate the cross-validated AUC. For example, Bradley pointed out that some averaged AUCs from ROC curves correspond to each partition and others aggregated the outputs of all folds first, producing one ROC and calculating its AUC

Altogether there does not appear to be a consensus.

Another motivation for asking the question again is that I cannot find any reference that has considered how to calculate performance metrics other than the AUC, like the calibration slope (beta) or net benefit by decision curve analysis, by cross-validation. Does anyone have any advice?

Best Answer

I think if the folds are of equal size, then both methods 1 and 2 will give the same mean value (or very similar if the folds are only of approximately the same size).

Personally I would tend to use the first method, because I then also have sensible estimate of the variance of the performance metric, which means I can then estimate whether a difference in performance between two models is at least statistically significant.

To expand a bit more on that, say you have two models $M_1$ and $M_2$, then you can use cross-validation to estimate the performance of each method and pick the best one. However, if you split the data into folds in a different way (different random seed) then the performance estimates are likely to be different, and could even be different enough for the ranking of the two models to be reversed. So we would like some method to see if the difference in performance is large compared to the difference caused by the random partitioning of the data to form the folds. We can do that for method 1, as we have a performance estimate from each fold, which differ only in the partitioning of the data. So the variance of those estimates is measuring that variability. For the second method, we only have one performance estimate, so we have no corresponding estimate of the uncertainty due to the sampling of the data to form folds. This is an excellent reason to use method 1 rather than method 2.

Note the variance discussed in the previous section is not the variance we would see if we had a completely new training set and new test set. The variance of the cross-validation is likely to be optimistically narrow because the training sets in each fold are not independent, but share some examples (and the test data in some folds are in the training sets of others, so that isn't completely independent either). This lack of independence has an impact on estimating the uncertainty of the performance estimate, see [Nadeau and Bengio (2003)]

Nadeau, C., Bengio, Y. Inference for the Generalization Error. Machine Learning 52, 239–281 (2003). https://doi.org/10.1023/A:1024068626366

The link to the related question means I can comment on method 3. I think I agree with @FrankHarrell in his answer that this is not an appropriate method. The point of cross-validation is to estimate not the performance of a particular model, but a method for producing a model. So the performance must be measured over the test folds using a model fitted using that method. If you average the outputs of models trained in different folds then you are not using the model that you will deploy in operation as it will not have that averaging step.

If you want to do something like that, I would opt for Bagging, where you are explicitly forming a committee or ensemble model (which has the averaging) and you can use the "out-of-bag" estimator to get a good performance estimate for the ensemble.

I have to say, I have never really liked repeated k-fold cross-validation. It seems to me that there is more symmetry in just having 100 random 90%/10% training/test splits rather than 10 lots of 10-fold cross-validation. However I don't know of a good statistical reason to prefer one over the other.

Related Solutions

Solved – AUC in ordinal logistic regression

I only like the area under the ROC curve ($c$-index) because it happens to be a concordance probability. $c$ is a building block of rank correlation coefficients. For example, Somers' $D_{xy} = 2\times (c - \frac{1}{2})$. For ordinal $Y$, $D_{xy}$ is an excellent measure of predictive discrimination, and the R rms package provides easy ways to get bootstrap overfitting-corrected estimates of $D_{xy}$. You can backsolve for a generalized $c$-index (generalized AUROC). There are reasons not to consider each level of $Y$ separately because this does not exploit the ordinal nature of $Y$.

In rms there are two functions for ordinal regression: lrm and orm, the latter handling continuous $Y$ and providing more distribution families (link functions) than proportional odds.

Solved – How to get predicted values from cross validation

I do not know about MATLAB, but in scikit-learn you can get the probabilities by doing a manual nested cross-validation.

The code would be roughly,

# Dividing data set for outer cv
cv_outer = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 43)

# Outer cv beginning
for (train, test) in cv_outer.split(X, y):
    # parameters for grid search
    pipe_logis = Pipeline([('logis', LogisticRegression(random_state = 1))])
    param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    param_grid = [{'logis__C': param_range, 'logis__penalty': ['l1']}, 
             {'logis__C': param_range, 'logis__penalty': ['l2']}]

    #Dividing data set for inner cv
    cv_inner = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

    # Inner cv done by these 2 lines
    gs = GridSearchCV(estimator=pipe_logis, param_grid=param_grid, scoring='f1', 
         cv= cv_inner)
    gs.fit(X[train], y[train]) #we are using only the training fold for 
                                hyper-parameter tuning

    # get best parameters
    print gs.best_params_

    # predict with best parameters
    scores = gs.best_estimator_.predict(X[test])

    # These are your probabilities. Here I am getting it for the test fold, 
    but you can get probabilities for your training fold.
    probas_ = gs.best_estimator_.predict_proba(X[test])

Also, a nested cross-validation is only for an unbiased performance measurement. After getting a sense of performance measurement for various algorithms, you choose the algorithm with the best performance. Then you do a k-fold cross-validation with the whole dataset to finally choose your best set of hyper-parameters for that algorithm. You do not want to use the performance measurement from this later k-fold cross-validation.

Of course, all of this is for small datasets. If you have a large dataset, you can just divide it between training set, validation set and test set. Then you use the validation set for hyper-parameter tuning.

Best Answer

Related Solutions

Solved – AUC in ordinal logistic regression

Solved – How to get predicted values from cross validation

Related Question