Solved – Measuring statistical significance of machine learning algorithms comparison

machine learningstatistical significance

Let us consider a comparison of two machine learning algorithms (A and B) on some dataset. Results (root mean squared error) of both algorithms depend on randomly generated initial approximation (parameters).

Questions:

  1. When I use the same parameters for both algorithms, "usually" A slightly outperforms B. How many different experiments (with different parameters / updated /) have I to perform to make "sure" that A is better than B?
  2. How to measure significance of my results? (To what extent I am "sure"?)

Relevant links are welcome!

PS. I've seen papers in which authors use t-test and p-value; but i'm not sure if it is ok to use them in a such situation.

UPDATE.
The problem is that A (almost) always outperforms B if initial params and learning/validation/testing sets are the same; but it doesn't neccessarily hold if they differ.

I see the following approaches here:

  • split data into disjoint sets D_1, D_2, …; generate parameters params_1; compare A(params_1, D_2, …,) and B(params_1, D_2, …,) on D_1;
    generate params_2; compare A(params_2, D_1, D_3,…) and B(params_2, D_1, D_3,…) on D_2 and so on.
    Remember how often A outperforms B.

  • split data into disjoint sets D_1, D_2, …; generate parameters params_1a and params_1b; compare A(params_1a, D_2, …,) and B(params_1b, D_2, …,) on D_1; ….
    Remember how often A outperforms B.

  • first, do cross-validation for A. Then, independently, for B. Compare results.

Which approach is better? How to find significance of the result in this best case?

Best Answer

  1. You have two biases to remove here -- the selection of the initial parameters set and the selection of train/test data. So, I don't think it is good to compare algorithms based on the same initial parameters set; I would just run the evaluation over few different initial sets for each of the algorithms to get more general approximation. The next step is something that you are probably doing already, so using some kind of cross-validation.
  2. t-test is a way to go (I assume that you are getting this RMS as a mean from cross validation [and evaluation over few different starting parameters set, supposing you decided to use my first suggestion], so you can also calculate the standard deviation); more fancy method is to use Mann-Whitney-Wilcoxon test.

Wikipedia article about cross validation is quite nice and have some references worth reading.

UPDATE AFTER UPDATE: I still think that making paired test (Dikran's way) looks suspicious.

Related Question