It's a good question. Personally, my answer would be that it never makes sense to throw data away (unless it is for computational reasons), as the more data you have, the better your model of the world can be. Therefore, I would suggest that modifying the cost function in appropriate way for your task should be sufficient. For example, if you are interested in one particular rare class, you can make misclassifications of this class only more expensive; if you are interested in a balanced measure, something like Balanced Error Rate (the average of the errors on each class) or the Matthews Correlation Coefficient is appropriate; if you are interested only in overall classification error, the traditional 0-1 loss.
A modern approach to the problem is to use Active Learning. For example, Hospedales et al (2011) "Finding Rare Classes: Active Learning with Generative and Discriminative Models, IEEE Transactions on Knowledge and Data Engineering, (TKDE 2011). However I believe the these approaches are still relatively less mature.
First of all, before testing you need to define couple of things: do all classification errors have same "cost"? Then you chose a single measurement parameter. I usually chose MCC for binary data and Cohen's kappa for k-category classification. Next it is very important to define what is the minimal difference that is significant in your domain? When I say "significant" I don't mean statistically significant (i.e. p<1e-9), but practically significant. Most of the time improvement of 0.01% in classification accuracy means nothing, event if it has nice p-value.
Now you can start comparing the methods. What are you testing? Is it the predictor sets, model building process or the final classifiers. In the first two cases I would generate many bootstrap models using the training set data and test them on bootstrap samples from the testing set data. In the last case I would use the final models to predict bootstrap samples from the testing set data. If you have a reliable way to estimate noise in the data parameters (predictors), you may also add this to both training and testing data. The end result will be two histograms of the measurement values, one for each classifier. You may now test these histograms for mean value, dispersion, etc.
Two last notes: (1) I'm not aware of a way to account for model complexity when dealing with classifiers. As a result better apparent performance may be a result of overfitting. (2) Having two separate data sets is a good thing, but as I understand from your question, you used both sets for many times, which means that the testing set information "leaks" into your models. Make sure you have another, validation data set that will be used only once when you have made all the decisions.
Clarifications following notes
In your notes you said that "previous papers usually present such kind [i.e. 1%] of improvements". I'm not familiar with this field, but the fact that people publish 1% improvement in papers does not mean this improvement is significant :-)
Regarding t-test, I think it would be a good choice, provided that the data is normally distributed or converted to normal distribution or that you have enough data samples, which you will most probably will.
Best Answer
In most situations, more data is usually better. Overfitting is essentially learning spurious correlations that occur in your training data, but not the real world. For example, if you considered only my colleagues, you might learn to associate "named Matt" with "has a beard." It's 100% valid ($n=4$, even!) when considering only the small group of people working on floor, but it's obviously not true in general. Increasing the size of your data set (e.g., to the entire building or city) should reduce these spurious correlations and improve the performance of your learner.
That said, one situation where more data does not help---and may even hurt---is if your additional training data is noisy or doesn't match whatever you are trying to predict. I once did an experiment where I plugged different language models[*] into a voice-activated restaurant reservation system. I varied the amount of training data as well as its relevance: at one extreme, I had a small, carefully curated collection of people booking tables, a perfect match for my application. At the other, I had a model estimated from huge collection of classic literature, a more accurate language model, but a much worse match to the application. To my surprise, the small-but-relevant model vastly outperformed the big-but-less-relevant model.
A surprising situation, called **double-descent**, also occurs when size of the training set is close to the number of model parameters. In these cases, the test risk first decreases as the size of the training set increases, transiently *increases* when a bit more training data is added, and finally begins decreasing again as the training set continues to grow. This phenomena was reported 25 years in the neural network literature (see Opper, 1995), but occurs in modern networks too ([Advani and Saxe, 2017][1]). Interestingly, this happens even for a linear regression, albeit one fit by SGD ([Nakkiran, 2019][2]). This phenomenon is not yet totally understood and is largely of theoretical interest: I certainly wouldn't use it as a reason not to collect more data (though I might fiddle with the training set size if n==p and the performance were unexpectedly bad).
[*]A language model is just the probability of seeing a given sequence of words e.g. $P(w_n = \textrm{'quick', } w_{n+1} = \textrm{'brown', } w_{n+2} = \textrm{'fox'})$. They're vital to building halfway decent speech/character recognizers.