Solved – Cross Validation and Confidence Interval of the True Error

classificationcross-validationmachine learning

I'm interested in the relation between Cross Validation and the True Error Estimation of a Classifier (Chapter 5 – Machine Learning – Mitchell).

Suppose we have 150 examples, I decide to use a 100 repeated 5-fold Cross Validation to understand the behavior of my classifier. At this point I have $100 \times 5$ results and I can use the mean and std dev of the error rates to estimate the variance of the model performance.

$$\text{mean(Error Rate)} \pm z \times \text{std(Error Rate)}$$

I could estimate Confidence Interval of the True Error (that I would obtain on the unseen data) using the the average Error rate:

$$\text{mean(Error Rate)} \pm z \times \sqrt{ \frac{mean(errorRate) \times (1 – mean(errorRate) )}{ N}}$$

Two questions:

  1. Is this approach correct?
  2. Is correct to set $N=150$ in the second equation or I should use the average number of Test Data used as Test Set in each fold of CV ($N = 30$)?

Best Answer

Background

Consider the following definitions:

  • A training set $D = {z_1,...,z_n}$ with $Z_i \in Z$ independently sampled from an unknown distribution $P$.
  • An algorithm $A$ which maps a data set to a function $A : \mathcal{Z} \rightarrow \mathcal{F}$
  • A loss function $L$. e.g. a quadratic loss: $L(f,x,y)) = (f(x) - y)^2$ or a missclassification ${0,1}$ loss $L(f,x,y)) = 1_{f(x) \neq y}$
  • Let $f = A(D)$ be the function returned by algorithm $A$ on the training set $D$

We can define two measures of importance:

  1. Prediction error: expected loss on future test examples $$ PE(D) = E[L(f,z)]$$ Where the expectation is taken with respect to $z$ sampled from $P$

  2. Expected performance error: a more general measure which is the expected loss on training sets of size $n$ sampled from $P$ $$ EPE(n) = E[L(A(D),z)]$$ Where the expectation is taken with respect to $D$ sampled from $P$ and $z$ is independently sampled from $P$ also


Cross validation estimator

In practice the data set $D$ is chunked into $K$ disjoint subsets of the same size with $m = n / K$. Let us write $T_k$ for the $k$-th such block and $D_k$ for the training set obtained by removing the elements in $T_k$ from $D$, then

The cross validation estimator is defined as the average of the errors on test block $T_k$ obtained when training the algorithm on $D_k$ $$ CV(D) = \frac{1}{K} \sum_{k=1}^K \frac{1}{m} \sum_{z_i \in T_k} L(A(D_k), z_i)$$

Once you have this Cross Validation Estimator $CV(D)$ you can construct bootstrap confidence intervals around it in the same way as you would for any other estimator. Bootstrap your dataset, compute the estimator, repeat many times...

The difficulty lies in understanding what you are actually computing, and how close it is from the truth... Here are a few questions: - Is the mean of the $CV$ an estimator of the $PE$ or the $EPE$ ? - What about the variance of $CV$ ? Does it inform us about the uncertainty of the PE or EPE ? - Under what conditions is the difference $|CV(D) - PE(D)|$ bounded ? - And lastly how does bootstrapping effect all of the above ?

This is an active topic of research and there are different views, conclusions, proofs to be taken into account. The paper linked below is a good place to start if you want to go into more depth.


Sources and further reading