Solved – Randomizing Class Labels during classification to asses the feature selection results

classificationfeature selectionmachine learningpermutation-testregression-strategies

I have a binary classification problem with thousands of variables and less than a hundred data points and class labels. The class is imbalanced (24 positive 51 negative samples). I have selected some of the features using subset selection method. The AUC of classification using XGBOOST feature selection (10 variables) is approximately 85%. Just to make sure that the apparent accuracy is not overstated (biased from overfitting)?, I ran an experiment such that:
I took all the 10 variables as it is and I just randomly shuffle the class labels such that the number of positive samples and negative samples are same. When I run such experiments 1000 times, I get highest AUC of 75% and the average peaks at around 60%.

enter image description here

So, my question here is:

  1. When class labels are flipped (permuted) shouldn't the AUC of classification peak at 50%? Is the result shown in the density plot as expected?
  2. I also get the AUC values less than .50 using the following R code using ROCR package:

    predf <- prediction(pred, as.factor(testset$ClassLabels))
    auc.tmp <- performance(predf,"auc")
    aucval <- aucval + as.numeric(auc.tmp@y.values)

I read that it means classification is negative. Can anybody explain in understandable terms, what that means?

Best Answer

"To make sure the results are significant" is not the question to ask. You should ask "is the apparent accuracy overstated (biased from overfitting)?". Your randomization approach exposes the phenomenal amount of bias you would expect given your setup. But you have bigger problems:

  1. Concordance probability ($c$-index; AUROC) is not a classification measure (thank goodness), so your nomenclature needs to be corrected. And it would be better to use a proper accuracy scoring rule, which would be much more sensitive than $c$ (e.g., deviance or Brier score).
  2. Your sample size is inadequate for doing even the most trivial thing: estimating the probability of "positive". The minimum sample size needed to estimate an unknown probability of $\pm 0.1$ margin of error (with 0.95 confidence) is $n=96$. In other words if you just fit a binary logistic model for the probability of positive and included only an intercept, you do not have a sufficient sample size to adequately estimate the intercept with $n=75$.
  3. To expect a sample size that is inadequate for providing the crudest possible summary of tendencies to be able to tell you which features are important and how you should fashion predictions out of them is a big stretch.

The bootstrap is a better way to study the damage caused by the methods you are using.