Do I understand you correctly that you want to measure whether C1 is a faster/slower learner than C2?
With unlimited training data, I'd definitively construct (measure) the learning curves. That allows you to discuss both questions you pose.
As Dikran already hints, the learning curve does have a variance as well as a bias component: training on smaller data gives systematically worse models but there is also higher variance between different models trained with smaller $n_{train}$ which I'd also include in a discussion which classifier is better.
Make sure you test with large enough test sample size: proportions of counts (such as classifier accuracy) suffer from high variance which can mess up your conclusions. As you have an unlimited data source, you are in the very comfortable situation that it is actually possible to measure the learning curves without too much additional testing error on them.
I just got a paper accepted that summarizes some thoughts and findings about Sample Size Planning for Classification Models. The DOI does not yet function, but anyways here's the accepted manuscript at arXiv.
Of course computation time is your consideration now. Some thoughts on this
how much computer time you are willing to spend will depend on what you need your comparison for.
if it's just about finding a practically working set-up, I'd be pragmatic also about the time to get to a decision.
if it's a scientific question, I'd quote my old supervisor "
Computer time is not a scientific argument". This is meant in the sense that saving a couple of days or even a few weeks of server time by compromising the conclusions you can draw is not a good idea*.
The more so, as having better calculations doesn't necessarily require more of your time here: your time to set up the calculations will take roughly the same time whether you calculate on a fine grid of training sample sizes or a rough one, or whether you measure variance by 1000 iterations or just by 10. This means that you can do calculations in an order that allows to get a "sneak-preview" on the results quite fast, then you can sketch the results, and at the end pull in the fine-grained numbers.
(*) I may add that I come from an experimental field where you easily spend months or years on sample collection and weeks or months measurements which don't do themselves in the way a simulation runs on a server, neither.
Update about bootstapping / cross validation
It is certainly possible to use (iterated/repeated) cross validation or out-of-bootrap testing to measure the learning curve. Using resampling schemes instead of a proper independent test set is sensible if you are in a small sample size situation, i.e. you do not have enough independent samples for training of a good classifier and properly measuring its performance. According to the question, this is not the case here.
Data-driven model optimization
One more general point: choosing a "working point" (i.e. training sample size here) from the learning curve is a data-driven decision. This means that you need to do another independent validation of the "final" model (trained with that samples size) with another independen test set. However, if your test data for measuring the learining curve was independent and had huge (really large) sample size, then your risk to overfit to that test set is minute. I.e. if you find a drop in performance for the final test data, that indicates either too small test sample size for determining the learning curve or a problem in your data analysis set up (data not independent, training data leaking into test data).
Update 2: limited test sample size
is a real problem. Comparing many classifiers (each $n_{train}$ you evaluate ultimately leads to one classifier!) is a multiple testing problem from a statistics point of view. That means, judging by the same test set "skims" the variance uncertainty of the testing. This leads to overfitting.
(This is just another way to express the danger of cherry-picking Dikran commented about)
You really need to reserve an independent test set for final evaluation, if you want to be able to state the accuracy of the finally chosen model.
While it is hard to overfit to a test set of millions of instances, it it much easier to overfit to 350 samples per class.
Therefore, the paper I linked above may be of more interest for you than I initially thought: it also shows how to calculate how much test samples you need to show e.g. superioriority of one classifier (with fixed hyperparameters) over another. As you can test all models with the same test set, you may be lucky so that you are able to somewhat reduce the required test sample size by doing paired tests here. For paired comparison of 2 classifiers, McNemar test would be a keyword.
If all you have is P/R/F1 scores for the two systems/classifiers there's no way of testing whether the difference between the two is statistically significant. For the McNemar's test, as you suggested, you would need the predictions of the two systems.
If you have other labeled data and the implementation of the two systems, you can test those on the data (shuffling and 5- or 10-fold cross-validating several times) so that you can perform a statistical test.
Best Answer
First of all, before testing you need to define couple of things: do all classification errors have same "cost"? Then you chose a single measurement parameter. I usually chose MCC for binary data and Cohen's kappa for k-category classification. Next it is very important to define what is the minimal difference that is significant in your domain? When I say "significant" I don't mean statistically significant (i.e. p<1e-9), but practically significant. Most of the time improvement of 0.01% in classification accuracy means nothing, event if it has nice p-value.
Now you can start comparing the methods. What are you testing? Is it the predictor sets, model building process or the final classifiers. In the first two cases I would generate many bootstrap models using the training set data and test them on bootstrap samples from the testing set data. In the last case I would use the final models to predict bootstrap samples from the testing set data. If you have a reliable way to estimate noise in the data parameters (predictors), you may also add this to both training and testing data. The end result will be two histograms of the measurement values, one for each classifier. You may now test these histograms for mean value, dispersion, etc.
Two last notes: (1) I'm not aware of a way to account for model complexity when dealing with classifiers. As a result better apparent performance may be a result of overfitting. (2) Having two separate data sets is a good thing, but as I understand from your question, you used both sets for many times, which means that the testing set information "leaks" into your models. Make sure you have another, validation data set that will be used only once when you have made all the decisions.
Clarifications following notes
In your notes you said that "previous papers usually present such kind [i.e. 1%] of improvements". I'm not familiar with this field, but the fact that people publish 1% improvement in papers does not mean this improvement is significant :-)
Regarding t-test, I think it would be a good choice, provided that the data is normally distributed or converted to normal distribution or that you have enough data samples, which you will most probably will.