Let's step back and look at what the data would look like. From what you describe, 3 algorithms (i.e. groups or treatments) and 10 datasets (i.e. subjects). In this case, you have a a within-subjects design (i.e. repeated measures) with one factor. One way to represent this is like this:
set.seed(123)
df <- data.frame(dataset = rep(seq(10), 3),
algorithm = rep(c("ML1","ML2","ML3"), each=10),
Accuracy = runif(30))
> df
dataset algorithm Accuracy
1 1 ML1 0.28757752
2 2 ML1 0.78830514
3 3 ML1 0.40897692
4 4 ML1 0.88301740
5 5 ML1 0.94046728
6 6 ML1 0.04555650
7 7 ML1 0.52810549
8 8 ML1 0.89241904
9 9 ML1 0.55143501
10 10 ML1 0.45661474
11 1 ML2 0.95683335
12 2 ML2 0.45333416
13 3 ML2 0.67757064
14 4 ML2 0.57263340
15 5 ML2 0.10292468
16 6 ML2 0.89982497
17 7 ML2 0.24608773
18 8 ML2 0.04205953
19 9 ML2 0.32792072
20 10 ML2 0.95450365
21 1 ML3 0.88953932
22 2 ML3 0.69280341
23 3 ML3 0.64050681
24 4 ML3 0.99426978
25 5 ML3 0.65570580
26 6 ML3 0.70853047
27 7 ML3 0.54406602
28 8 ML3 0.59414202
29 9 ML3 0.28915974
30 10 ML3 0.14711365
You will typically see examples that have 'subject' as a label. In your case, your 'subjects' are 'datasets'. If you can assume normality, you would do repeated-measures ANOVA. However, you state you know the accuracies are not normally distributed and you naturally want a non-parametric method. Your dataset is also balanced (10 samples/group) so we can use the Friedman test (which essentially is a nonparametric repeated-measures ANOVA).
If you get a significant p-value from the test, you would do post-hoc analysis with a pairwise paired Wilcoxon test with some sort of correction (e.g. bonferroni, holm, etc.). You would not use Mann-Whitney because you have 'paired/repeated measures' data.
Lastly, you probably want the effect size any significant differences. This also would use the wilcoxon test. In R there is no function I can recall right now but the equation is very simple:
$$r=\frac{Z}{sqrt(N)}$$
Where Z is the Z-score and N is the sample size (between the two groups being compared). You can get this Z-score using the wilcoxsign_test
from the coin
package.
Using the above data, this can be done in R with the following. Please note, the above data was just randomly generated so there is no significance. This is just for demonstrating some code:
# Friedman Test
friedman.test(Accuracy ~ algorithm|dataset, data=df)
# Post-hoc tests with 'bonferroni correction'
with(df, pairwise.wilcox.test(Accuracy, algorithm, p.adj="bonferroni", paired=T))
# Get Z-score for calculating effect-size
library(coin)
with(df, wilcoxsign_test(Accuracy ~ factor(algorithm)|factor(dataset),
data=df[algorithm == "ML1" | algorithm == "ML2",]))
# Calculate effect size, in this case Z = -0.2548, two groups is 20 datasets
0.2548/sqrt(20)
From what I understood of the OP question:
1) He ran a omnibus Kruskal-Wallis with significant results
2) He want to run a pairwise test on all groups and he is in doubt whether to use Mann-Whitney or Dunn's test
3) He want to run his own multiple comparison adjustment procedure, so he needs the uncorrected p-values of each pairwise comparions.
The source of confusion is that Dunn test implemented in GraphPad seems to already include a multiple comparison adjustment (which looks like a Bonferroni adjustment - see http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_nonparametric_multiple_compari.htm).
Answering:
2) You should use Dunn test. Both the CV answer by @Alexis for Post-hoc tests after Kruskal-Wallis: Dunn's test or Bonferroni corrected Mann-Whitney tests? and this site from XLSAT http://www.xlstat.com/en/products-solutions/feature/kruskal-wallis-test.html agree that Dunn (or Conover-Iman or Steel-Dwass-Critchlow-Fligner ) are the appropriate post-hoc tests after a KW (disclosure - I did not know that until today - have been using Mann-Whitney as post-hoc to KW until today).
3) I did not understand the GaphPad page, but let me point you to the dunn.test package in R does what the OP want. In particular it distinguishes the Dunn test and multiple comparison adjustments, and one can set the adjustment method to "none", which will return the unadjusted p-values.
Also notice that among the adjustment procedures there are the Benjamini-Hochberg (95) and the Benjamini-Yekutieli (2001) adjustments that are FDR (maybe one of them is the one the OP is thinking in using).
Let me stress of many of the commentators have been saying - there is no good reason to use the unadjusted p-values EXCEPT to implement your own adjustment procedure - no decision should be made based on the unadjusted p-values.
Best Answer
It is certainly fine to do pairwise chi-square tests, but that isn't the only possibility. Another is to fit a generalized linear model and follow it up with pairwise comparisons of its predictions. In R, it goes something like this:
This fits a logistic regression model for predicting $\log\{p_i/(1-p_i)\}, i=1,2,3$. A chi-squared test (not the same as the Pearson chi-square, but similar) for $H_0:p_1=p_2=p_3$ is obtained via
so that the test statistic is $\chi^2 = 4.55$ with 2 d.f.
The post-hoc estimates and comparisons are done in a manner similar to that for ordinary ANOVA models:
The least-squares means (first table) are predictions from the model for $\log\{p_i/(1-p_i)\}$ and the contrasts are pairwise comparisons of these quantities. Alternatively, you could back-transform these results and obtain estimates of the $p_i$ themselves, and of the odds ratios $\frac{p_i}{1-p_i}/\frac{p_j}{1-p_j}$:
The advantage of this approach is that you obtain comparisons of meaningful quantities, rather than just chi-squares and $P$ values. The Tukey adjustment on the comparisons is only approximate; but then, so are the results of pairwise chi-squared tests, and the Bonferroni correction is more conservative.