Solved – How to statistically compare two algorithms across three datasets in feature selection and classification

computational-statisticsgenetic algorithmsmachine learningstandardizationstatistical significance

Problem background: As part of my research, I have written two algorithms that can select a set of features from a data set (gene expression data from cancer patients). These features are then tested to see how well they can classify an unseen sample as either cancer or non-cancer. For each run of the algorithm, a solution (a set of features) is generated and tested on Z unseen samples. Percentage accuracy of the solution is calculated like this: (correct classifications / Z) * 100.

I have two algorithms: algorithm X & algorithm Y

I have three separate (different cancers) data sets: data set A, data set B & data set C. These data sets are very different from each other. They don't have the same number of samples or same number of measurements (features) per sample.

I have run each algorithm 10 times on each data set. So, algorithm X has 10 results from data set A, 10 from data set B and 10 from data set C. Overall, algorithm X has 30 results.

My problem: I would like to see if algorithm X's combined performance across all three data sets is statistically significantly different from algorithm Y's combined performance.

Is it possible for me to combine results for algorithm X from each data set into a single set of results? This way, I would have 30 standardized results for algorithm X and 30 standardized results for algorithm Y. I can then use the t-test to see if there is a significant difference between the two methods.

Edit – These are Evolutionary Algorithms, so they return a slightly different solution each time they are run. However, if there's a feature in a sample that when present can strongly classify the sample as either being cancer or non-cancer, then that feature will be selected almost every time the algorithm is run.

I get slightly different results for each of the 10 runs due to the following reasons:

These algorithms are randomly seeded.
I use repeated random sub-sampling validation (10 repeats).
The datasets that I use (DNA microarray and Proteomics) are very difficult to work with in the sense that there are many local optima the algorithm can get stuck in.
There are lots of inter-feature and inter-subset interactions that I would like to detect.
I train 50 chromosomes and they are not restricted to any particular length. They are free to grow and shrink (although selection pressure guides them towards shorter lengths). This also introduces some variation to the final result.

Having said, the algorithm almost always selects a particular subset of features!

Here's a sample of my results (only 4 runs out of 10 for each algorithm is shown here):

Dataset/run     Algorithm X     Algorithm Y
A 1             90.91           90.91
A 2             90.91           95.45
A 3             90.91           90.91
A 4             90.91           90.91

B 1             100             100
B 2             100             100
B 3             95.65           100
B 4             95.65           86.96

C 1             90.32           87.10
C 2             70.97           80.65
C 3             96.77           83.87
C 4             80.65           83.87

As you can see, I've put together results for two algorithms from three datasets. I can do Kruskal-Wallis test on this data but will it be valid? I ask this because:

I'm not sure accuracies in different data sets are commensurable. If they are not, then putting them together like I've done would be meaningless and any statistical test done on them would also be meaningless.
When you put accuracies together like this, the overall result is susceptible to outliers. One algorithm's excellent performance on one dataset may mask it's average performance on another dataset.

I cannot use t-test in this case either, this is because:

Commensurability – the t-test only makes sense when the differences over the data sets are commensurate.
t-test requires that the differences between the two algorithms compared are distributed normally, there's no way (at least that I'm aware of) to guarantee this condition in my case.
t-test is affected by outliers which skew the test statistics and decrease the test’s power by increasing the estimated standard error.

What do you think? How can I do a comparison between algorithm X & Y in this case?

Best Answer

Unless your algorithms have huge differences in performance and you have huge numbers of test cases, you won't be able to detect differences by just looking at the performance.

However, you can make use of apaired design:

run all three algorithms on exactly the same train/test split of a data set, and
do not aggregate the test results into % correct, but keep them at the single test case level as correct or wrong.

For the comparison, have a look at McNemar's test. The idea behind exploiting a paired design here is that all cases that both algorithms got right and those that both got wrong do not help you distinguishing the algorithms. But if one algorithm is better than the other, there should be many cases the better algorithm got right but not the worse, and few that were predicted correctly by the worse method but wrong by the better one.

Also, Cawley and Talbot: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, JMLR, 2010, 1, 2079-2107. is highly relevant.

Because of the random aspects of your algorithms, you'll also want to check the same split of the same data set multiple times. From that you can estimate the variation between different runs that are otherwise equal. It may be difficult to judge how different the selected sets of variables are. But if your ultimate goal is predictive performance,then you can also use the variation between predictions of the same test case during different runs to measure the stability of the resulting models.

You'll then also want to check (as indicated above) variation due to different splits of the data set and put this into relation with the first variance.

Fractions (like % correctly recognized samples) are usually assumed to be distributed binomially, in certain cases a normal approximation is possible, but the small-print for this hardly ever met in fields with wide data matrices. This has the consequence that confidence intervals are huge for small numbers of test cases. In R, binom::binom.confint calculates confidence intervals for the true proportion given no. of tests and no. of successes.

Finally, My experience with genetic optimization for spectroscopic data (my Diplom thesis in German) suggests that you should check also the training errors. GAs tend to overfit very fast, arriving at very low training errors. Low training errors are not only overoptimistic, but they also have the consequence that the GA cannot differentiate between lots of models that seem to be equally perfect. (You may have less of a problem with that if the GA internally also randomly subsamples train and internal test sets).

Papers in English:

Beleites, Steiner, Sowa, Baumgartner, Sobottka, Schackert and Salzer: Classification of human gliomas by infrared imaging spectroscopy and chemometric image processing, Vib Spec, 2005, 38, 143 - 149.
application oriented, shows how we interpret the selected variates and that there are known "changes" in selections that are equivalent for chemical reasons.
Beleites and Salzer: Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 2008, 390, 1261 - 1271.
- discusses binomial confidence intervals for hit rate etc, why we could not calculate it. Instead, we use variation observed over repeated cross validation. The discussion about guesstimate for the effective sample size needs to be taken with caution, nowadays I'd say it is somewhere between misleading and wrong (I guess that is science moving on :-) ).
- fig. 11 shows the independent (outer cross validation with splitting by specimen) test results from 40 different models for the same specimen. The corresponding internal estimate of the GA was 97.5% correct predictions (one model got all spectra of the specimen wrong, the remaining models were correct for all spectra - not shown in the paper). This is the problem with overfitting I mentioned above.
In this recent paper Beleites, Neugebauer, Bocklitz, Krafft and Popp: Sample Size Planning for Classification Models, Anal Chem Acta, 2013, 760, 25 - 33. we illustrate the problems with classification error estimation with too few patients/independent samples. Also illustrates how to estimate necessary patient numbers if you cannot do paired tests.

Best Answer

Related Solutions

Solved – How to compare models from different but related datasets

Solved – How to Compare Two Algorithms with Multiple Datasets and Multiple Runs

Related Question