Hypothesis Testing – Comparing Performances of Multiple Machine Learning Models

hypothesis testingmachine learningmodel comparisonmultiple-comparisonsreferences

I have $k$ Machine Learning models, trained on the same training set, and I want to test the hypothesis that their performance on a fresh test set is the same. See EDIT 1 below for what I mean with performance. I'm looking for frequentist answers. Also, I'd prefer an answer that refers to a paper or preprint on this topic (testing the null that $k$ ML models have the same performance on a finite-sample test set). The models are Computer Vision models.

  1. Is there a distribution-free, exact test that allows me to test the above hypothesis?
  2. If there's no such test, is there at least a distribution-free, approximate test?
  3. As a worst-case scenario, I could be contented with a parametric, asymptotic test. But it seems to me a terrible option: this is ML, so we know nothing about the models, and we know nothing about the test set data distribution (apart from the fact that it's an iid sample). Yeah, I know "if the size of the test set $n_{test}\to\infty$ we can assume that…". Point is, $n_{test}$ isn't going anywhere. Acquiring and labelling the test set was very expensive, and I'm not getting new labeled data anytime soon. Thus, I'd rather get an answer that doesn't rely on the size of the test set being very large.

EDIT 1: the models are Computer Vision models, meaning that the test set $S=\{(X_i,Y_i)\}_{i=1}^{n_{test}}$ is made of couples (image, structured label). The problem is that, for each image $X_i$, the corresponding label $Y_i$ is not just a categorical variable (we're not doing image classification here), but it's a list of $n_i$ categorical variables (the classes of each of the $n_i$ objects presents in image $X_i$) and $4n_i$ real numbers (the coordinates of the bounding boxes of each object: see here for details). Thus, it's not so simple to say if, for image $X_i$, the prediction $f(X_i)$ of the model is right (1) or wrong (0). I naively thought of solving this using some aggregate metric, since for each of the $k$ models, we can compute a loss over the whole test set, which is a nonnegative number. However, as correctly noted by Jacques Wainer, this approach doesn't work for hypothesis testing. Let's consider then a different setting, where each of the $k$ models (each of the $k$ treatments, in statistical parlance) returns a real number in $[0,1]$ for each of the $n_{test}$ elements of the test set. Thus, for a single metric, our dataset is a matrix $M$ of $k\times n_{test}$ numbers in $[0,1]$. See EDIT 3 for the multiple metrics case.

EDIT 2: there's something that I think was overlooked in comments & answers. I think the $k\times n_{test}$ entries of $M$ are not iid. All models are trained on the same training set, and are tested on the same images, thus I would expect the model results for the a given image $f_1(X_i),\dots,f_k(X_i)$ to be highly correlated.

EDIT 3 (hopefully the final one): let's consider now more than one metric for each image (the multiple hypotheses case I mentioned in earlier versions of this question). For $d$ metrics, I'd get $d$ matrices $M_1,\dots, M_d$ of size $k\times {n_{test}}$. As duly noted by Christian Hennig, different metrics for the same image are likely to be correlated (see also EDIT 2 above), and thing soon get very complicated. So let's forget about multiple hypothesis testing. I'll accept Christian Hennig's suggestion:

decide in advance to use only one test as "major test of interest", interpret that one, and give only a rough "exploratory" interpretation of all the other tests in relation to the major test

Best Answer

Statistical tests, any of them, work on sets of data, not on a single data (for each of the $k$ ML models). Thus, if you compute any measure on the whole test set (let us call them aggregating measures) be it precision, recall, or IoU, you will get a single number for each of the ML algorithms, and there is no statistical test that can receive a single number (for each treatment) and compute a p-value.

So, you cannot compute singe measure for each algorithm on the test set. The only set of data that you have is whether, for each data point in the data set, a particular algorithm got the prediction on that data right or wrong. Thus for each algorithm you have a set of binary data (0 or 1, correct or incorrect, and so on) on each of the data points in the test set.

These measures are paired, or in the parlance of statistical tests, it is blocked - for the same data point in the test set you have the corresponding binary outcome (right or wrong) for each algorithm.

Therefore, you want a test for binary variables (right or wrong), on multiple treatments, and blocked.

The only one I know is the Cochran's Q test. The test is distribution free (but I am not sure if it is exact).

If the p-value is high enough, you can conclude that all algorithm are equally correct, and thus (I believe) any summary measure, such as precision, recall, accuracy, will be "statistically equivalent" (there is no such a thing, but I think given that the Q test tells you that there is no statistical significant difference among the output of all algorithms I believe one can conclude that there is "no difference" for the aggregating measures).

Answering the EDITs:

EDIT1: if the output for each data point is a number (for example between 0 and 1 as you suggested but this works for any number) then you are in luck. What you have is a set of numbers and not of 0/1 and there are many more statistical tests for non-binary number data.

The usual test in machine learning is the one proposed by Demsar (Statistical Comparisons of Classifiers over Multiple Data Sets) in https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf Remember that in your case the multiple datasets of the paper is the multiple datapoints in your test set. Demsar proposed a Friedman test followed by the Nemenyi test to determine which algorithm is significantly different than the others. Since you are hoping that all algorithms are equivalent, if you are lucky the Friedman test will result in the p-value high enough (be mindful of the caveats listed in the comments for my answer). There are implementations of these tests in both Python and R (at least).

Garcia e Herrera ( An Extension on" Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons.) https://www.jmlr.org/papers/volume9/garcia08a/garcia08a.pdf proposed other post-hoc tests (beyond the Nemenyi test).

EDIT 2: The data used in the tests are i.i.d. The fact that the algorithms are trained on the same set is not a problem. The conclusion will take that into consideration - your conclusion is that the algorithms when trained on the that same training set, and tested on the that same test set are statistically significantly different or are not statistically significantly different. Your conclusions will be for that particular training and test sets.

EDIT 3: I don't know about multiple aggregating metrics. But first, you are right that they are likely not i.i.d Second, there will be very few data for the statistical test. Say you will use 5 or 10 aggregating metrics, that will leave you with only 5 or 10 data for each algorithm. With so few data, the statistical tests will not likely find that the differences are significative!

Related Question