Solved – Cross validation in semi-supervised learning

With semi-supervised learning a labeled set $X_L$ and unlabeled set $X_U$ are given. If the learning algorithm has several free-parameters we are forced to perform cross-validation to try to guess them. Cross-validation can only be applied to the labeled set. So:

If we have very few labeled examples, say about 1%-10%, is it better to apply LOO-CV?
If the ratio of labeled samples is bigger (e.g. 50-60%), k-fold CV might be better (as LOO-CV can introduce a huge bias), but then, can we assume that the sets will be $i.i.d.$? Here is it better to pick a low k?

What is the best way validate a model in semi-supervised learning?

Best Answer

Cross-validation procedures are not invalidated in this setting, provided you use a suitable score function.

A suitable score function could be recall^2 divided by the probability for a classifier to predict one. This is a surrogate for the F1-score in a PU learning setting. You can find an example of its use here.

Best Answer

Related Solutions

Related Question