- You have two biases to remove here -- the selection of the initial parameters set and the selection of train/test data.
So, I don't think it is good to compare algorithms based on the same initial parameters set; I would just run the evaluation over few different initial sets for each of the algorithms to get more general approximation.
The next step is something that you are probably doing already, so using some kind of cross-validation.
- t-test is a way to go (I assume that you are getting this RMS as a mean from cross validation [and evaluation over few different starting parameters set, supposing you decided to use my first suggestion], so you can also calculate the standard deviation); more fancy method is to use Mann-Whitney-Wilcoxon test.
Wikipedia article about cross validation is quite nice and have some references worth reading.
UPDATE AFTER UPDATE: I still think that making paired test (Dikran's way) looks suspicious.
One of the linked posts above alludes to using a likelihood ratio test, although your models have to be nested in one another for this to work (i.e. all the parameters in one of the models must be present in the model you are testing it against).
RMSE is clearly a measure of how well the model fits the data. However, so is likelihood ratio. The likelihood for a given person, say Mrs. Chen, is the probability that a person with all her parameters had the outcome she had. The joint likelihood of the dataset is Mrs. Chen's likelihood * Mrs. Gundersen's likelihood * Mrs. Johnson's likelihood * ... etc.
Adding a covariate, or any number of covariates, can't really make the likelihood ratio worse, I don't think. But it can improve the likelihood ratio by a non-significant amount. Models that fit better will have a higher likelihood. You can formally test whether model A fits model B better. You should have some sort of LR test function available in whatever software you use, but basically, the LR test statistic is -2 * the difference of the logs of the likelihoods, and it's distributed chi-square with df = the difference in the number of parameters.
Also, comparing the AIC or BIC of the two models and finding the lowest one is also acceptable. AIC and BIC are basically the log likelihoods penalized for number of parameters.
I'm not sure about using a t-test for the RMSEs, and I would actually lean against it unless you can find some theoretical work that's been done in the area. Basically, do you know how the values of RMSE are asymptotically distributed? I'm not sure. Some further discussion here:
http://www.stata.com/statalist/archive/2012-11/index.html#01017
Best Answer
As I don't have enough reputation to comment, I'd put it here as an 'answer'.
But first of all, I am not very sure what you're looking for. I'd try my best to guess what you mean and please correct me if I'm wrong.
You have two difference statistical models, and the two models were based on two difference data-sets. You want to compare them for statistical significance.
Short answer: no, you can't.
Long answer: Basically when you compare two models, you want to know which one have a greater explanatory / predictive power with reference to the same data-set. Although you can compare the AICs (asymptotically equal to cross-validation errors) or even the R^2 (for linear regressions) but it is almost impossible to interpret the comparison. IN addition, to my knowledge, most common statistical tests for model fitness are designed for nested models only, i.e. your model A is a subset of model B.