Solved – Cross validation in semi-supervised learning

cross-validationmachine learningsemi-supervised-learning

With semi-supervised learning a labeled set $X_L$ and unlabeled set $X_U$ are given. If the learning algorithm has several free-parameters we are forced to perform cross-validation to try to guess them. Cross-validation can only be applied to the labeled set. So:

  1. If we have very few labeled examples, say about 1%-10%, is it better to apply LOO-CV?

  2. If the ratio of labeled samples is bigger (e.g. 50-60%), k-fold CV might be better (as LOO-CV can introduce a huge bias), but then, can we assume that the sets will be $i.i.d.$? Here is it better to pick a low k?

What is the best way validate a model in semi-supervised learning?

Best Answer

Cross-validation procedures are not invalidated in this setting, provided you use a suitable score function.

A suitable score function could be recall^2 divided by the probability for a classifier to predict one. This is a surrogate for the F1-score in a PU learning setting. You can find an example of its use here.

Related Question