Generally, the problems of machine learning may be considered variations on function estimation for classification, prediction or modeling.
In supervised learning one is furnished with input ($x_1$, $x_2$, ...,) and output ($y_1$, $y_2$, ...,) and are challenged with finding a function that approximates this behavior in a generalizable fashion. The output could be a class label (in classification) or a real number (in regression)-- these are the "supervision" in supervised learning.
In the case of unsupervised learning, in the base case, you receives inputs $x_1$, $x_2$, ..., but neither target outputs, nor rewards from its environment are provided. Based on the problem (classify, or predict) and your background knowledge of the space sampled, you may use various methods: density estimation (estimating some underlying PDF for prediction), k-means clustering (classifying unlabeled real valued data), k-modes clustering (classifying unlabeled categorical data), etc.
Semi-supervised learning involves function estimation on labeled and unlabeled data. This approach is motivated by the fact that labeled data is often costly to generate, whereas unlabeled data is generally not. The challenge here mostly involves the technical question of how to treat data mixed in this fashion. See this Semi-Supervised Learning Literature Survey for more details on semi-supervised learning methods.
In addition to these kinds of learning, there are others, such as reinforcement learning whereby the learning method interacts with its environment by producing actions $a_1$, $a_2$, . . .. that produce rewards or punishments $r_1$, $r_2$, ...
(3) doesn't have to be bad if you have some prior about what the clusters might look like, however you wouldn't be using your labelled data optimally. As you point out, you can iteratively train a classifier on its own output.
(2) isn't that different from (3) really, it'll depend on how good your metric is
(1) is what I would recommend, though it doesn't have to be S3VM. A Bayesian model would treat all the missing label as latent variables and learn the posterior distribution of both the missing labels and the classifier's parameters.
Best Answer
It seems as if deep learning might be very interesting for you. This is a very recent field of deep connectionist models which are pretrained in an unsupervised way and fine tuned afterwards with supervision. The fine tuning requires a much less samples than the pretraining.
To wet your tongue, I recommend [Semantig Hashing Salakhutdinov, Hinton. Have a look at the codes this finds for distinct documents of the Reuters corpus: (unsupervised!)
If you need some code implemented, check out deeplearning.net. I don't believe there are out of the box solutions, though.