Solved – Permutation-based p-value to evaluate a classifier

classificationmachine learningp-valuepermutation-test

I read an article entitled "Permutation Tests for Studying Classifier Performance" which explains how to use "permutation-based p-value" to test the performance of a classifier.
I can not understand how to calculate the p-value. The article reports this formula:

 p = (|{D′ ∈ D: e(f,D′) ≤ e(f,D)}|+1)/(k+1)
 (DEFINITION 1)

Where e(f,D) is some error of the classifier f on the dataset D or D' (D' is the dataset obtained with permutation).

This is the article:
http://jmlr.org/papers/volume11/ojala10a/ojala10a.pdf

But I do not understand the formula. Why the error with D' is less than the error with D? and can I calculate the absolute value of an equation with "<=" ? And what is k?

The article says:

The empirical p-value of Definition 1 represents the fraction of randomized samples where
the classifier behaved better in the random data than in the original data. Intuitively, it measures how likely the observed accuracy would be obtained by chance, only because the classifier identified in the training phase a pattern that happened to be random. Therefore, if the p-value is small enough usually under a certain threshold, for example,
α = 0.05 we can say that the value of the error in the original data is indeed significantly small and in consequence, that the classifier is significant
under the given null hypothesis, that is, the null hypothesis is rejected.

Best Answer

The formula uses the default notation for sets.

So $\{D′ ∈ \hat{D}: e(f,D′) ≤ e(f,D)\}$ is read as: A set containing all permutations $D'$ of $D$ in $\hat{D}$ restricted to those where $e(f,D′) ≤ e(f,D)$.

The pipes around the set definition ("$|$") do note the cardinality / size of the set.

The textual part above the formula contains the definition of k. $\hat{D}$ is the set of all created permutations and k is the size / cardinality of this set, so $k=|\hat{D}|$

Example: Suppose you create 4 permutations $D'_1,D'_2,D'_3,D'_4$ of a dataset $D$ and calculate $e$ for every permutation. Let

  • $\hat{D}=\{D'_1,D'_2,D'_3,D'_4\}$
  • $e(f,D)=0.1$
  • $e(f,D'_1)=e(f,D'_2)=e(f,D'_3)=0.2$
  • $e(f,D'_4)=0.05$

=> p = $\frac{1+1}{4+1}$

Related Question