Solved – Cross Validation and Confidence Interval of the True Error

classificationcross-validationmachine learning

I'm interested in the relation between Cross Validation and the True Error Estimation of a Classifier (Chapter 5 – Machine Learning – Mitchell).

Suppose we have 150 examples, I decide to use a 100 repeated 5-fold Cross Validation to understand the behavior of my classifier. At this point I have $100 \times 5$ results and I can use the mean and std dev of the error rates to estimate the variance of the model performance.

$$\text{mean(Error Rate)} \pm z \times \text{std(Error Rate)}$$

I could estimate Confidence Interval of the True Error (that I would obtain on the unseen data) using the the average Error rate:

$$\text{mean(Error Rate)} \pm z \times \sqrt{ \frac{mean(errorRate) \times (1 – mean(errorRate) )}{ N}}$$

Two questions:

Is this approach correct?
Is correct to set $N=150$ in the second equation or I should use the average number of Test Data used as Test Set in each fold of CV ($N = 30$)?

Best Answer

Background

Consider the following definitions:

A training set $D = {z_1,...,z_n}$ with $Z_i \in Z$ independently sampled from an unknown distribution $P$.
An algorithm $A$ which maps a data set to a function $A : \mathcal{Z} \rightarrow \mathcal{F}$
A loss function $L$. e.g. a quadratic loss: $L(f,x,y)) = (f(x) - y)^2$ or a missclassification ${0,1}$ loss $L(f,x,y)) = 1_{f(x) \neq y}$
Let $f = A(D)$ be the function returned by algorithm $A$ on the training set $D$

We can define two measures of importance:

Prediction error: expected loss on future test examples $$ PE(D) = E[L(f,z)]$$ Where the expectation is taken with respect to $z$ sampled from $P$
Expected performance error: a more general measure which is the expected loss on training sets of size $n$ sampled from $P$ $$ EPE(n) = E[L(A(D),z)]$$ Where the expectation is taken with respect to $D$ sampled from $P$ and $z$ is independently sampled from $P$ also

Cross validation estimator

In practice the data set $D$ is chunked into $K$ disjoint subsets of the same size with $m = n / K$. Let us write $T_k$ for the $k$-th such block and $D_k$ for the training set obtained by removing the elements in $T_k$ from $D$, then

The cross validation estimator is defined as the average of the errors on test block $T_k$ obtained when training the algorithm on $D_k$ $$ CV(D) = \frac{1}{K} \sum_{k=1}^K \frac{1}{m} \sum_{z_i \in T_k} L(A(D_k), z_i)$$

Once you have this Cross Validation Estimator $CV(D)$ you can construct bootstrap confidence intervals around it in the same way as you would for any other estimator. Bootstrap your dataset, compute the estimator, repeat many times...

The difficulty lies in understanding what you are actually computing, and how close it is from the truth... Here are a few questions: - Is the mean of the $CV$ an estimator of the $PE$ or the $EPE$ ? - What about the variance of $CV$ ? Does it inform us about the uncertainty of the PE or EPE ? - Under what conditions is the difference $|CV(D) - PE(D)|$ bounded ? - And lastly how does bootstrapping effect all of the above ?

This is an active topic of research and there are different views, conclusions, proofs to be taken into account. The paper linked below is a good place to start if you want to go into more depth.

Sources and further reading

The theory, definitions and explanations are taken from this paper from Grand Valet
The following two stackexchange posts go into significant depth on the topic: Bias and variance in leave-one-out vs K-fold cross validation and Variance of $K$-fold cross-validation estimates as $f(K)$: what is the role of "stability"?

Best Answer

Background

Cross validation estimator

Sources and further reading

Related Solutions

Solved – Is cross validation a proper substitute for validation set

Solved – How to perform 10-fold cross validation by manually constructing datasets

Related Question