Solved – How to compare the accuracy of two different models using statistical significance

classificationmachine learningmodel-evaluationstatistical significancetime series

I am working on time series prediction. I have two data sets $D1=\{x_1, x_2,….x_n\}$ and $D2=\{x_n+1, x_n+2, x_n+3,…., x_n+k\}$. I have three prediction models: $M1, M2, M3$. All of those model are trained using samples in data set $D1$, and their performance is measured using the samples in data set $D2$. Let say the performance metrics is MSE (or anything else). The MSE of those models when measured for data set $D2$ are $MSE_1, MSE_2, $ and $MSE_3$. How can I test that improvement of one model over another is statistically significant.

For example, let say $MSE_1=200$, $MSE_2=205$, $MSE_3=210$, and total number of sample in data set $D2$ based upon which those MSE are calculated is 2000. How can I test that $MSE_1$, $MSE_2$, and $MSE_3$ are significantly different. I would greatly appreciate if anyone can help me in this problem.

Best Answer

One of the linked posts above alludes to using a likelihood ratio test, although your models have to be nested in one another for this to work (i.e. all the parameters in one of the models must be present in the model you are testing it against).

RMSE is clearly a measure of how well the model fits the data. However, so is likelihood ratio. The likelihood for a given person, say Mrs. Chen, is the probability that a person with all her parameters had the outcome she had. The joint likelihood of the dataset is Mrs. Chen's likelihood * Mrs. Gundersen's likelihood * Mrs. Johnson's likelihood * ... etc.

Adding a covariate, or any number of covariates, can't really make the likelihood ratio worse, I don't think. But it can improve the likelihood ratio by a non-significant amount. Models that fit better will have a higher likelihood. You can formally test whether model A fits model B better. You should have some sort of LR test function available in whatever software you use, but basically, the LR test statistic is -2 * the difference of the logs of the likelihoods, and it's distributed chi-square with df = the difference in the number of parameters.

Also, comparing the AIC or BIC of the two models and finding the lowest one is also acceptable. AIC and BIC are basically the log likelihoods penalized for number of parameters.

I'm not sure about using a t-test for the RMSEs, and I would actually lean against it unless you can find some theoretical work that's been done in the area. Basically, do you know how the values of RMSE are asymptotically distributed? I'm not sure. Some further discussion here:

http://www.stata.com/statalist/archive/2012-11/index.html#01017

Related Question