Solved – Calculating the variance of a model

data miningmachine learning

I often hear about the bias-variance tradeoff to evaluate classifiers. Now I want to calculate them. I often compute the AUC of a binary classifier to evaluate its performance and do a 10-fold cross-validation. My advisor told me to compute the variance of the 10-folds cross-validation AUCs. He said that this variance is not the true variance of the model, but it can give you an idea about how robust the model is and its true variance.

My questions:

  • Is this approach true to come up with a rough estimate for the variance of the model?

  • Is it possible to calculate the true variance of the model? If so, how?

Best Answer

I'll start with the second question:

It's only possible to calculate the true variance if you have a set of every possible randomly-drawn data set $D$ (or an equivalent PDF) and every single $x$ input possible (or an equivalent PDF). This is the result of the mathematical definition of the variance of a learning model/algorithm: $E_x[E_D[g^d(x)-\bar{g}(x)]]$, where $d\in D$. (slide source)

This will obviously never happen. The only time when you can exactly calculate the variance is when you create your own data. Even then, both $D$ and $x$ will likely be infinite (if you have any continuous predictors or outputs), so you'll just be approximating the variance by drawing large samples from $D$ and $x$.

First question:

If you are using 10-fold CV, you should be able to treat the 10 different sets as a sample of $D$ and the 10 different testing sets as a sample of $x$. The only problem is that the training sets wouldn't be independent (refer to the comments link by Matthew Drury). Also I'm not sure though if $k=10$ is the best number for estimating the variance of a model.

Additional Source:

In case anyone wants to learn more about this, I'd highly recommend this lecture. I found the derivation in the first 40-50 minutes to be helpful.