Solved – Choosing a classification performance metric for model selection, feature selection, and publication

auccross-validationmodel selectionsvm

I have a small, unbalanced data set (70 positive, 30 negative), and I have been playing around with model selection for SVM parameters using BAC (balanced accuracy) and AUC (area under the curve). I used different class-weights for the C parameter in libSVM to offset the unbalanced data following the advice here (Training a decision tree against unbalanced data).

  1. It seems that k-fold cross-validation error is very sensitive to the type of performance measure. It also has an error in itself because the training and validation sets are chosen randomly. For example, if I repeat BAC twice with different random seeds, I will get different errors, and subsequently different optimal parameter values. If I average repeated BAC scores, averaging 1000 times will give me different optimal parameter values than averaging 10000 times. Moreover, changing the number of folds gives me different optimal parameter values.

  2. Accuracy metrics for cross validation may be overly optimistic. Usually anything over a 2-fold cross-validation gives me 100% accuracy. Also, the error rate is discretized due to small sample size. Model selection will often give me the same error rate across all or most parameter values.

  3. When writing a report, how would I know that a classification is 'good' or 'acceptable'? In the field, it seems like we don't have something like a goodness of fit or p-value threshold that is commonly accepted. Since I am adding to the data iteratively, I would like to know when to stop- what is a good N where the model does not significantly improve?

Given the issues described above, it seems like accuracy can't be easily compared between publications while AUC has been described as a poor indicator for performance (see here, or here, for example).

Any advice on how to tackle any of these 3 problems?

Best Answer

It seems that k-fold cross-validation error is very sensitive to the type of performance measure. It also has an error in itself because the training and validation sets are chosen randomly.

I think you've discovered the high variance of performance measures that are proportions of case counts such as $\frac{\text{# correct predictions}}{\text{# test cases}}$. You try to estimate e.g. the probability that your classifier returns a correct answer. From a statistics point of view, that is described as a Bernoulli trial, leading to a binomial distribution. You can calculate confidence intervals for binomial distributions and will find that they are very wide. This of course limits your ability to do model comparison.

With resampling validation schemes such as cross validation you have an additional source of variation: the instability of your models (as you build $k$ surrogate models during each CV run)

Moreover, changing the number of folds gives me different optimal parameter values.

That is to be expected due to the variance. You may have an additional effect here: libSVM splits the data only once if you use their built-in cross validation for tuning. Due to the nature of SVMs, if you built the SVM with identical training data and slowly vary the parameters, you'll find that support vectors (and consequently accuracy) jumps: as long as the SVM parameters are not too different, it will still choose the same support vectors. Only when the paraters are changed enough, suddenly different support vectors will result. So evaluating the SVM parameter grid with exactly the same cross validation splits may hide variability, which you see between different runs.

IMHO the basic problem is that you do a grid search, which is an optimization that relies on a reasonably smooth behaviour of your target functional (accuracy or whatever else you use). Due to the high variance of your performance measurements, this assumption is violated. The "jumpy" dependence of the SVM model also violates this assumption.

Accuracy metrics for cross validation may be overly optimistic. Usually anything over a 2-fold cross-validation gives me 100% accuracy. Also, the error rate is discretized due to small sample size. Model selection will often give me the same error rate across all or most parameter values.

That is to be expected given the general problems of the approach.

However, usually it is possible to choose really extreme parameter values where the classifier breaks down. IMHO the parameter ranges where the SVMs work well is important information.

In any case you absolutely need an external (double/nested) validation of the performance of the model you choose as 'best'.

I'd probably do a number of runs/repetitions/iterations of an outer cross validation or an outer out-of-bootstrap validation and give the distribution of

  • hyperparameters for the "best" model
  • reported performance of the tuning
  • observed performance of outer validation

The difference between the last two is an indicator of overfitting (e.g. due to "skimming" the variance).

When writing a report, how would I know that a classification is 'good' or 'acceptable'? In the field, it seems like we don't have something like a goodness of fit or p-value threshold that is commonly accepted. Since I am adding to the data iteratively, I would like to know when to stop- what is a good N where the model does not significantly improve?

(What are you adding? Cases or variates/features?)

First of all, if you do an iterative modeling, you either need to report that due to your fitting procedure your performance is not to be taken seriously as it is subject to an optimistic bias. The better alternative is to do a validation of the final model. However, the test data of that must be independent of all data that ever went into training or your decision process for the modeling (so you may not have any such data left).

Related Question