Do I understand you correctly that you want to measure whether C1 is a faster/slower learner than C2?
With unlimited training data, I'd definitively construct (measure) the learning curves. That allows you to discuss both questions you pose.
As Dikran already hints, the learning curve does have a variance as well as a bias component: training on smaller data gives systematically worse models but there is also higher variance between different models trained with smaller $n_{train}$ which I'd also include in a discussion which classifier is better.
Make sure you test with large enough test sample size: proportions of counts (such as classifier accuracy) suffer from high variance which can mess up your conclusions. As you have an unlimited data source, you are in the very comfortable situation that it is actually possible to measure the learning curves without too much additional testing error on them.
I just got a paper accepted that summarizes some thoughts and findings about Sample Size Planning for Classification Models. The DOI does not yet function, but anyways here's the accepted manuscript at arXiv.
Of course computation time is your consideration now. Some thoughts on this
how much computer time you are willing to spend will depend on what you need your comparison for.
if it's just about finding a practically working set-up, I'd be pragmatic also about the time to get to a decision.
if it's a scientific question, I'd quote my old supervisor "
Computer time is not a scientific argument". This is meant in the sense that saving a couple of days or even a few weeks of server time by compromising the conclusions you can draw is not a good idea*.
The more so, as having better calculations doesn't necessarily require more of your time here: your time to set up the calculations will take roughly the same time whether you calculate on a fine grid of training sample sizes or a rough one, or whether you measure variance by 1000 iterations or just by 10. This means that you can do calculations in an order that allows to get a "sneak-preview" on the results quite fast, then you can sketch the results, and at the end pull in the fine-grained numbers.
(*) I may add that I come from an experimental field where you easily spend months or years on sample collection and weeks or months measurements which don't do themselves in the way a simulation runs on a server, neither.
Update about bootstapping / cross validation
It is certainly possible to use (iterated/repeated) cross validation or out-of-bootrap testing to measure the learning curve. Using resampling schemes instead of a proper independent test set is sensible if you are in a small sample size situation, i.e. you do not have enough independent samples for training of a good classifier and properly measuring its performance. According to the question, this is not the case here.
Data-driven model optimization
One more general point: choosing a "working point" (i.e. training sample size here) from the learning curve is a data-driven decision. This means that you need to do another independent validation of the "final" model (trained with that samples size) with another independen test set. However, if your test data for measuring the learining curve was independent and had huge (really large) sample size, then your risk to overfit to that test set is minute. I.e. if you find a drop in performance for the final test data, that indicates either too small test sample size for determining the learning curve or a problem in your data analysis set up (data not independent, training data leaking into test data).
Update 2: limited test sample size
is a real problem. Comparing many classifiers (each $n_{train}$ you evaluate ultimately leads to one classifier!) is a multiple testing problem from a statistics point of view. That means, judging by the same test set "skims" the variance uncertainty of the testing. This leads to overfitting.
(This is just another way to express the danger of cherry-picking Dikran commented about)
You really need to reserve an independent test set for final evaluation, if you want to be able to state the accuracy of the finally chosen model.
While it is hard to overfit to a test set of millions of instances, it it much easier to overfit to 350 samples per class.
Therefore, the paper I linked above may be of more interest for you than I initially thought: it also shows how to calculate how much test samples you need to show e.g. superioriority of one classifier (with fixed hyperparameters) over another. As you can test all models with the same test set, you may be lucky so that you are able to somewhat reduce the required test sample size by doing paired tests here. For paired comparison of 2 classifiers, McNemar test would be a keyword.
"Why exactly does a classifier need the same prevalence in the train and test sets?"
Perhaps my answer to a related question on the DS SE might help
Doesn't over(/under)sampling an imbalanced dataset cause issues?
Yes, the classifier will expect the relative class frequencies in
operation to be the same as those in the training set. This means
that if you over-sample the minority class in the training set, the
classifier is likely to over-predict that class in operational use.
To see why it is best to consider probabilistic classifiers, where the
decision is based on the posterior probability of class membership
p(C_i|x), but this can be written using Bayes' rule as
$p(C_i|x) = \frac{p(x|C_i)p(C_i)}{p(x)}\qquad$ where $\qquad p(x) =
> \sum_j p(x|C_j)p(c_j)$,
so we can see that the decision depends on the prior probabilities of
the classes, $p(C_i)$, so if the prior probabilities in the training
set are different than those in operation, the operational performance
of our classifier will be suboptimal, even if it is optimal for the
training set conditions.
Some classifiers have a problem learning from imbalanced datasets, so
one solution is to oversample the classes to ameliorate this bias in
the classifier. There are to approaches. The first is to oversample
by just the right amount to overcome this (usually unknown) bias and
no more, but that is really difficult. The other approach is to
balance the training set and then post-process the output to
compensate for the difference in training set and operational priors.
We take the output of the classifier trained on an oversampled dataset
and multiply by the ratio of operational and training set prior
probabilities,
$q_o(C_i|x) \propto p_t(x|C_i)p_t(C_i) \times \frac{p_o(C_i)}{p_t(C_i}
> = p_t(x|C_i)p_o(C_i)$
Quantities with the o subscript relate to operational conditions and
those wit the t subscript relate to training set conditions. I have
written this as $q_o(C_i|x)$ as it is an un-normalised probability,
but it is straight forward to renormalise them by dividing by the sum
of $q_o(C_i|x)$ over all classes. For some problems it may be better
to use cross-validation to chose the correction factor, rather than
the theoretical value used here, as it depends on the bias in the
classifier due to the imbalance.
So in short, for imbalanced datasets, use a probabilistic classifier
and oversample (or reweight) to get a balanced dataset, in order to
overcome the bias a classifier may have for imbalanced datasets. Then
post-process the output of the classifier so that it doesn't
over-predict the minority class in operation.
Specific issues:
If we have a very imbalanced data set (say 99:1 for two classes), I
dont see why balancing the training set would introduce any problems.
It doesn't present a problem, provided you post-process the output of the model to compensate for the difference in training set and operational class frequencies. If you don't perform that adjustment (or you use a discrete yes-no classifier) you will over-predict the minority class for the reason given above.
"fans of “classifiers” sometimes subsample from observations in the
most frequent outcome category (here Y=1) to get an artificial 50/50
balance of Y=0 and Y=1 when developing their classifier. Fans of such
deficient notions of accuracy fail to realize that their classifier
will not apply to a population when a much different prevalence of Y=1
than 0.5."
I don't think this accurately represents the situation. The reason for balancing is actually because the majority class is "more important" in some sense than the minority class, and the rebalancing is an attempt to include misclassification costs so that it does work better in operational conditions. However a lot of blogs don't explain that properly, so a lot of practitioners are rather misinformed about it.
Best Answer
This is more of an extended comment as you have not given sufficient information to give detailed advice. Also, I have no experience with such a large-scale problem, and I suspect few really has. You say "I am designing a scikit learn classifier which has 5000+ categories and training data is at least 80 million and may grow upto an additional 100 million each year." which is a HUGE problem, and probably a major research project. You should take time to look at some papers describing similar efforts, like http://vision.stanford.edu/documents/DengBergLiFei-Fei_ECCV2010.pdf which describes trying to classify millions of images into 1000+ categories. I will cite a few paragraphs to show the inmensity of the project:
weeks, for a single run of one experiment, on a cluster of 66 machines
Do you have the resources for such a project?
If not, and even then, you should start out with some simplified project, see how that goes, and continue from that.
One idea: with thousands of categories, there must be some hierarchcal structure to the space of categories. If you can start mapping out that space, maybe organizing the categories in a binary tree, you could try a binary classifier for each level of the tree. Just a thought!
Another idea: mapping out the space of categories something like in multidimensional scaling .... would give coordinates to the categories, and then you could build a predictor for those coordinates. Something like that could work, or not, we do not know until somebody tries! I guess this is really white spots on the map ...
Good luck!