Statistical tests, any of them, work on sets of data, not on a single data (for each of the $k$ ML models). Thus, if you compute any measure on the whole test set (let us call them *aggregating measures*) be it precision, recall, or IoU, you will get a single number for each of the ML algorithms, and there is no statistical test that can receive a single number (for each treatment) and compute a p-value.

So, you cannot compute singe measure for each algorithm on the test set. The only set of data that you have is whether, for each data point in the data set, a particular algorithm got the prediction on that data right or wrong. Thus for each algorithm you have a set of binary data (0 or 1, correct or incorrect, and so on) on each of the data points in the test set.

These measures are paired, or in the parlance of statistical tests, it is blocked - for the same data point in the test set you have the corresponding binary outcome (right or wrong) for each algorithm.

Therefore, you want a test for binary variables (right or wrong), on multiple treatments, and blocked.

The only one I know is the Cochran's Q test. The test is distribution free (but I am not sure if it is exact).

If the p-value is high enough, you can conclude that all algorithm are equally correct, and thus (I believe) any summary measure, such as precision, recall, accuracy, will be "statistically equivalent" (there is no such a thing, but I think given that the Q test tells you that there is no statistical significant difference among the output of all algorithms I believe one can conclude that there is "no difference" for the aggregating measures).

Answering the EDITs:

EDIT1: if the output for each data point is a number (for example between 0 and 1 as you suggested but this works for any number) then you are in luck. What you have is a set of numbers and not of 0/1 and there are many more statistical tests for non-binary number data.

The usual test in machine learning is the one proposed by Demsar (Statistical Comparisons of Classifiers over Multiple Data Sets) in https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf Remember that in your case the multiple datasets of the paper is the multiple datapoints in your test set. Demsar proposed a Friedman test followed by the Nemenyi test to determine which algorithm is significantly different than the others. Since you are hoping that all algorithms are equivalent, if you are lucky the Friedman test will result in the p-value high enough (be mindful of the caveats listed in the comments for my answer). There are implementations of these tests in both Python and R (at least).

Garcia e Herrera ( An Extension on" Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons.) https://www.jmlr.org/papers/volume9/garcia08a/garcia08a.pdf proposed other post-hoc tests (beyond the Nemenyi test).

EDIT 2: The data used in the tests are i.i.d. The fact that the algorithms are trained on the same set is not a problem. The conclusion will take that into consideration - your conclusion is that the algorithms when trained on the **that** same training set, and tested on the **that** same test set are statistically significantly different or are not statistically significantly different. Your conclusions will be for **that** particular training and test sets.

EDIT 3: I don't know about multiple aggregating metrics. But first, you are right that they are likely not i.i.d Second, there will be very few data for the statistical test. Say you will use 5 or 10 aggregating metrics, that will leave you with only 5 or 10 data for each algorithm. With so few data, the statistical tests will not likely find that the differences are significative!

## Best Answer

The Mauchly's test allows to test if a given covariance matrix is proportional to a reference (identity or other) and is available through

`mauchly.test()`

under R. It is mostly used in repeated-measures design (to test (1) if the dependent variable VC matrices are equal or homogeneous, and (2) whether the correlations between the levels of the within-subjects variable are comparable--altogether, this is known as thesphericity assumption).Box’s M statistic is used (in MANOVA or LDA) to test for homogeneity of covariance matrices, but as it is very sensitive to normality it will often reject the null (R code not available in standard packages).

Covariance structure models as found in Structural Equation Modeling are also an option for more complex stuff (although in multigroup analysis testing for the equality of covariances makes little sense if the variances are not equal), but I have no references to offer actually.

I guess any textbook on multivariate data analysis would have additional details on these procedures. I also found this article for the case where normality assumption is not met: