First of all, before testing you need to define couple of things: do all classification errors have same "cost"? Then you chose a single measurement parameter. I usually chose MCC for binary data and Cohen's kappa for k-category classification. Next it is very important to define what is the minimal difference that is significant in your domain? When I say "significant" I don't mean statistically significant (i.e. p<1e-9), but practically significant. Most of the time improvement of 0.01% in classification accuracy means nothing, event if it has nice p-value.
Now you can start comparing the methods. What are you testing? Is it the predictor sets, model building process or the final classifiers. In the first two cases I would generate many bootstrap models using the training set data and test them on bootstrap samples from the testing set data. In the last case I would use the final models to predict bootstrap samples from the testing set data. If you have a reliable way to estimate noise in the data parameters (predictors), you may also add this to both training and testing data. The end result will be two histograms of the measurement values, one for each classifier. You may now test these histograms for mean value, dispersion, etc.
Two last notes: (1) I'm not aware of a way to account for model complexity when dealing with classifiers. As a result better apparent performance may be a result of overfitting. (2) Having two separate data sets is a good thing, but as I understand from your question, you used both sets for many times, which means that the testing set information "leaks" into your models. Make sure you have another, validation data set that will be used only once when you have made all the decisions.
Clarifications following notes
In your notes you said that "previous papers usually present such kind [i.e. 1%] of improvements". I'm not familiar with this field, but the fact that people publish 1% improvement in papers does not mean this improvement is significant :-)
Regarding t-test, I think it would be a good choice, provided that the data is normally distributed or converted to normal distribution or that you have enough data samples, which you will most probably will.
I think you are describing nested cross validation and you can use it to select your best hyperparameters. R already has some packages implementing this, for example for support vector machines you could use the package e1071 and do something like this assuming you have two independent variables:
svmTuning <- tune.svm(Y~X1+X2.,type="nu-regression",kernel="radial", data = dat, gamma = seq(from=0,to=3,by=0.1), cost=seq(from=2,to=16,by=2),
tunecontrol= tune.control(sampling="cross",cross=1000))
If you had 1000 observations the previous would perform leave-one-out cross validation, sweeping through the possible combinations of selected gammas and costs (but only one kernel in this case). You can see the best parameters by doing:
svmTuning$best.parameters
I'm pretty sure the optimal is chosen using the mean squared error calculated based on the cross validation you chose (in the case of regression) and average classification error.
Here's another example with kernel k-nearest neighbours
knnTuning <- train.kknn(Y~X1+X2., data=dat, kmax = 40, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov","gaussian", "rank", "optimal"),
ykernel = NULL, scale = TRUE,kcv=1000)
Which sweeps through all combinations of neighbors up to 40 and the different kernels but using the euclidean distance (distance=2). You may plot all these results and again obtain the best parameters:
plot(knnTuning)
knnTuning$best.parameters
You could do the same for random forest:
rfTuning <- tune.randomForest(Y~X1+X2, data = dat,ntree=1000, mtry=seq(from=2,to=10,by=1),
tunecontrol= tune.control(sampling="cross",cross=1000))
Where you just sweep through possible values for the amount of variables in the candidates for each split. This is known to overfit if not done carefully.
And so on and so forth. Since you appear to have a small sample size maybe leave-one-out is the way to go. Maybe you can also look into the caret package which has good capabilities for model building and the actual documentation is very solid (theoretical descriptions and all).
Best Answer
I think it depends on what you mean by "training time". If you think about cross validation, model selection, etc, a classifier which takes a "long time" to train can become incredibly painful, since you will be training it many times and waiting on the results before you can make further progress.