There are two aspects to all the different terms you have given:
1] Process of obtaining training data
2] Algorithm that trains $f$ or the classifier
The algorithm that trains $f$, regardless of how the training data is obtained is supervised. The difference in distant supervision, self-learning, self-supervised or weak supervision, lie purely then in how the training data is obtained.
Traditionally, in any machine learning paper on supervised learning, one would find that the paper implicitly assumes that the training data is available and for what its worth, it is usually assumed that the labels are precise, and that there is no ambiguity in the labels that are given to the instances in the training data. However, with distant/weak supervision papers, people realized that their training data has imprecise labels and what they want to usually highlight in their work is that they obtain good results despite the obvious drawback of using imprecise labels (and they may have other algorithmic ways to overcome the issue of imprecise labels, by having additional filtering process etc. and usually the papers would like to highlight that these additional processes are important and useful). This gave rise to the terms "weak" or "distant" to indicate that the labels on the training data are imprecise. Note that this does not necessarily impact the learning aspect of the classifier. The classifier that these guys use still implicitly assumes that the labels are precise and the training algorithm is hardly ever changed.
Self-training on the other hand is somewhat special in that sense. As you have already observed, it obtains its labels from its own classifier and has a bit of a feedback loop for correction. Generally, we study supervised classifiers under a slightly large purview of "inductive" algorithms, where the classifier learnt is an inductive inference made from the training data about the entire data. People have studied another form, which we call as transductive inference, where a general inductive inference is not the output of the algorithm, but the algorithm collectively takes both training data and test data as input and produces labels on the test data. However, people figured why not use transductive inference within inductive learning to obtain a classifier with larger training data. This is simply referred to as induction with unlabeled data [1] and self-training comes under that.
Hopefully, I have not further confused you, feel free to comment and ask for more clarifications if necessary.
[1] Might be useful - http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdf2527.pdf
Your intuition is right: machine learning is based on maths, not magic. If you can't give your machine enough examples of $(X,Y)$ pairs (where $X$ is an input vector of customer attributes and past behaviour, say, and $Y$ is 1 if the customer churned and 0 otherwise), there's no way it can learn any function that maps from $X$ to $Y$. In other words, you can't build a supervised ML model without having data about the target variable.
As you also say, I think your best shot is to train a model with the other dataset (which contains both $X$ and $Y$). If its performance on the test set is good enough and that data isn't too different from the other data (the one missing $Y$), you can then predict on such data and hope for the best. In the best of cases, "hoping for the best" would involve designing an adequate experiment to find out how well your model did on the unlabelled dataset.
On a separate but related note, you might also want to explore different techniques besides supervised ML (on your labelled data). For example, churn problems lend themselves quite well to survival-analysis techniques.
Best Answer
A Distant supervision algorithm usually has the following steps:
1] It may have some labeled training data
2] It "has" access to a pool of unlabeled data
3] It has an operator that allows it to sample from this unlabeled data and label them and this operator is expected to be noisy in its labels
4] The algorithm then collectively utilizes the original labeled training data if it had and this new noisily labeled data to give the final output.
Now, to answer your question, you as well as the site both are correct. You are looking at the 4th step of the algorithm and notice that at the 4th step one can use any algorithm that the user has access to. Hence your point, "it can be applied to both supervised and semi-supervised learning".
Whereas the site is looking at all the steps 1-4 collectively and notices that the noisily labeled data is obtained from a pool of unlabeled data (with or without the use of some pre-existing labeled training data) and this process of obtaining noisy labels is an essential component for any distant supervision algorithm, hence it is a semi-supervised algorithm.