I have evaluated two methods on a test set with $n=4252$ samples. Each sample gets a score from the two methods and the scores are in the range $[0,1]$.
The distribution of the scores of method 1 and method 2 are plotted in the following picture, along with the mean and median values:
The sample means of method 1 and 2 are calculated as $\mu_1=0.75$, $\mu_2=0.824$ and the sample standard deviations are $\sigma_1=0.236$, $\sigma_2=0.194$.
I have not done many significance tests before, so I am a bit lost how to proceed here and which method to apply; especially because the distributions do not seem to follow a normal distribution.
My null hypothesis $H_0$ is that "Method 1 and Method 2 do not differ significantly"; the alternative hypothesis $H_a$ is that "Method 2 is better than Method 1".
How can I test this?
Edit:
I think I need to be a bit more explicit here. I have the ground truth data for all the $n$ samples. Method 1 and Method 2 do something to every single sample and use a score function $f$ to compare the result with the ground truth. A score of 1 means that the method worked perfectly on this sample, a score of 0 means that the method failed completely. I think the details of how $f$ works and that it needs ground truth data for the comparison with the method's result is not relevant here. If it helps, let us just say that some wise entity tells us with absolute certainty for each sample how well method 1 and method 2 worked, that is, this entity gives us for each sample two numbers between 0 and 1 and this is all we work with.
Now I run the two methods separately on all $n$ samples so that I get $n$ scores for method 1 and $n$ scores for method 2. The distribution of the scores is plotted in the figure above.
One of the two methods should be used in a production environment, so I have to decide which one is better. The ideal method would return a score of 1 for all samples. So if for example method 1 would return a score of 0.1 for all samples and method to would always give 0.9 it would be easy to say that method 2 is favorable. But how can I go on here? I honestly dont care if medians, means or whole distributions are compared. All I want is to find out is which method performs better.
Link to scores of method 1:
https://pastebin.com/jkdBEJM1
Link to scores of method 2:
https://pastebin.com/1S4zYAaa
Edit 2
Here is the difference plot @NickCox suggested:
For each sample $i\in[1,n]$ the two scores $s_1^{(i)}$ and $s_2^{(i)}$ of method 1 and method 2 are used to calculate the difference $d_i=s_2^{(i)} – s_1^{(i)}$ and the mean $m_i=(s_1^{(i)}+s_2^{(i)})\cdot 0.5$ The point $(m_i/d_i)$ is then plotted.
However, this plot doesn't make me much smarter.
Best Answer
By and large, with this kind of sample size marginal distributions hardly matter to comparisons of means (and the fact that you have paired observations helps too, as you are looking at the pattern of differences).
Backing up first with some discursive remarks: It is common in many fields to compare measurements using two or more methods. When there is some gold standard or benchmark measurement taken to be true, we can say which method is best by comparison. Otherwise we can always look at the extent of agreement.
What I think requires some adjustment of thought here is that we have, as I understand it, not the original measurements (declared irrelevant!) but scores of how well each method did. So agreement of those scores doesn't necessarily mean that the methods agree, just that they are equally good or bad. That difference seems to make some methods for comparing methods less useful than they are with original measurements, or at least to imply that they need to be thought about differently.
We are comparing two distributions but values are paired and that structure needs to be respected. Assuming that I have interpreted the data files correctly, we can draw scatter plots too. Here are the original data:
It is hard to see what is going on there, even with some use of transparency, but the results are naturally consistent with the strongly skewed distributions shown by the OP.
I followed up my own suggestion of fourth powers (which arose from some earlier simulations with loosely similar beta distributions). The fourth power respects the interval $[0, 1]$ as $0^4 = 0$, $1^4 = 1$ and the transformation does not fall over if either extreme ($0$ or $1$) is met. More crucially, it does a reasonable job of making the distributions more symmetric and so a little easier to think about. People might say this transformation is ad hoc and they would be right: translated in 21st century terms, that means "fit for purpose".
Again, it is hard to see much structure there. Although not shown here, various smoothers don't help either. This is good news in so far as all the useful information is implied to be in distribution plots, to which we now turn. Here I use quantile-box plots in which quantile plots (each value against its rank) have median and quartile boxes superimposed. I also add lines for means.
As before we look at transformed scales too.
In short, the conclusion that method 2 is systematically better than method 1 is confirmed as robust under use of mean or median and under use of original and strongly transformed scales.
On the question of significance tests, I suggest that a general solution is to convert this to a different but related problem: find a confidence interval for the mean difference, that mean being something we can calculate on whatever scale we please.
I did this in Stata. The code should seem not too cryptic to those enamoured of other software.
Firing all possible shotguns, it can be seen that no reasonable confidence interval goes anywhere near 0, so even extreme sceptics should be convinced that means differ.