Solved – Bootstrap and randomization tests to compare paired data sets

bootstrappermutation-test

I have 50 values set as "true" in a simulation and attempts to recover these "true" parameters using 2 different models. Here's 3 (instead of 50) runs just to show the data layout (these are made up for illustration).

Run ,  "true" parameter ,  estimate model 1  , estimate model 2
1   ,       10          ,      9.5           ,      9.6
2   ,        8          ,      7.5           ,      8.1
3   ,        7          ,      7.1          ,       7.2

For each of the two models, I have calculated root mean squared error (error is difference between "truth" and estimate), mean absolute error, and Pearson correlation (correlation between 50 "truth" measures and 50 estimates). For each accuracy measure, I would like to compare the performance between the two models. Normality assumptions are not satisfied, so I would like to use bootstrap methods.

Since the data is paired, I've thought I could resample by runs (with replacement, 50 draws, and as many replications as feasible). I would then calculate the statistic of interest for each model and save the ratio of these for each run (RMSE model 1/RMSE model 2 for instance) and then determine confidence intervals using a percentile (or other) method.

I would also like to use a direct hypothesis testing approach via resampling. For this, I would shuffle the model outputs within pairs (so the estimated values for model 1 and model 2 would be switched, for instance) at random and then calculate my various ratios as before. This would give a range expected given a null hypothesis of interchangeability to compare to my observed values in order to attain a p-value.

I have ordered Efron and Tibshirani's Intro to the Bootstrap and will read parts that apply, but I have not found much literature on what I would like to attempt. Perhaps I am looking in the wrong place or it is an obvious dead end.

I am open to any advice or obvious flaws in my approach.

Thank you

Best Answer

Your permutation test is correct for testing if there is a difference in the quality of the 2 models. I don't understand enough about your bootstrap approach to know if it is correct or not. Another book to consider is "Bootstrap Methods and their Application" by Davison and Hinkley. It is a bit more recent and I believe more applied than Efron and Tibshirani.