R – How to Use pwr.t.test for Mann-Whitney U Test for Accurate Statistical Power

rstatistical-powerwilcoxon-mann-whitney-test

I'm working with (un-paired/independent) historic environmental data collected over 2 consecutive months that I compared for each calendar year (CYR). I'm wondering if the high variability between February and March is due to small-ish sample sizes, so as a thought experiment, I'd like to know what sample size I would need per group (Month) in order to compare median (water temp, salinity, etc..) between months of future fieldwork.

Using the pwr package in R…

> pwr.t.test(d=0.7, sig.level = 0.05, power = 0.80, type = "two.sample", alternative = "two.sided") 

     Two-sample t test power calculation 

              n = 33.02457
              d = 0.7
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group
    
round(33.02457*1.15,0) # paramteric + 15% approach
# N per group should be 38 for future studies

My questions:

  • What does "d=0.7" mean? "A large/the largest effect size" due to randomness in each sample? The smaller this is, the better the chance is that any difference I see will not be due to random variation? Do I set this to what I want it to be or what it's been in the past (For example: 0.2 + 0.705/2 = about 0.5 average effect size for Feb. vs March)?

  • Is the "paramteric + 15% approach" (As seen here on page 52) valid for the Mann Whitney U / Wilcoxon Rank-Sum Test?

My data:

> dput(dry_high_samplesize)
structure(list(use_for_analysis = structure(c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 
2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 
1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L), levels = c("Pre_SAV", "Standard"
), class = "factor"), CYR = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 3L, 2L, 2L, 2L, 1L, 4L, 4L, 4L, 1L, 4L, 3L, 2L, 2L, 3L, 
4L, 2L, 1L, 4L, 1L, 4L, 4L, 4L, 3L, 1L, 4L, 2L, 2L, 2L, 3L, 2L, 
1L, 4L, 3L, 4L, 4L, 3L, 1L, 1L, 1L, 4L, 1L, 3L, 2L, 4L, 2L, 3L, 
3L, 2L, 2L, 2L, 4L, 2L, 1L, 1L, 4L, 1L, 4L, 2L, 4L, 1L, 3L, 1L, 
3L, 1L, 2L, 4L, 3L, 2L, 2L, 2L, 1L, 4L, 2L, 1L, 4L, 1L, 2L, 3L, 
4L, 3L, 4L, 2L, 1L, 1L, 3L, 3L, 2L, 1L, 4L, 3L, 2L, 1L, 2L, 2L, 
4L, 4L, 1L, 3L, 1L, 3L, 4L, 2L, 1L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 
1L, 2L, 2L, 2L, 3L, 4L, 4L, 1L, 4L, 2L, 1L, 3L, 1L, 3L, 2L, 4L, 
2L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 4L, 3L, 1L, 2L, 4L, 2L, 2L, 4L, 
4L, 1L, 3L, 3L, 4L, 3L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 4L, 4L, 
3L, 2L, 2L, 3L, 1L, 1L, 3L, 1L, 2L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 
3L, 3L, 1L, 3L, 3L, 1L), levels = c("2006", "2015", "2016", "2018"
), class = "factor"), Season = c("DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", 
"DRY", "DRY", "DRY", "DRY"), Month = c(3, 2, 3, 2, 2, 3, 2, 3, 
4, 4, 4, 3, 3, 2, 3, 2, 3, 3, 4, 4, 4, 4, 3, 3, 2, 2, 3, 2, 3, 
3, 4, 3, 3, 4, 4, 3, 4, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 4, 
4, 3, 3, 4, 4, 3, 3, 4, 2, 3, 3, 2, 3, 2, 3, 4, 2, 3, 4, 2, 4, 
3, 3, 2, 3, 3, 4, 3, 3, 3, 3, 2, 2, 2, 4, 4, 3, 3, 3, 3, 3, 2, 
4, 3, 4, 3, 2, 4, 3, 3, 3, 3, 2, 3, 2, 4, 2, 3, 3, 3, 2, 3, 3, 
4, 4, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 2, 4, 2, 3, 4, 2, 
3, 3, 2, 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 3, 2, 3, 3, 3, 3, 
4, 4, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 4, 2, 2, 3, 3, 3, 2, 
3, 4, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3), Site = c(17, 46, 27, 37, 
45, 16, 47, 26, 23, 17, 9, 47, 16, 44, 15, 36, 17, 25, 6, 8, 
16, 22, 8, 40, 31, 35, 18, 43, 14, 24, 21, 15, 7, 15, 6, 31, 
13, 41, 14, 42, 41, 34, 23, 47, 47, 30, 19, 40, 39, 20, 14, 6, 
21, 5, 12, 39, 46, 7, 33, 30, 13, 29, 13, 38, 22, 13, 41, 20, 
4, 46, 19, 8, 20, 39, 46, 45, 5, 38, 12, 12, 29, 37, 32, 28, 
12, 3, 5, 40, 21, 19, 21, 45, 18, 45, 4, 7, 38, 11, 28, 11, 37, 
44, 31, 4, 27, 2, 36, 27, 20, 18, 44, 39, 22, 3, 10, 34, 11, 
44, 10, 27, 36, 43, 17, 3, 11, 6, 19, 10, 26, 1, 35, 38, 2, 30, 
26, 26, 43, 9, 35, 33, 43, 23, 10, 16, 5, 25, 2, 42, 1, 18, 29, 
9, 37, 42, 9, 15, 8, 25, 25, 24, 34, 42, 34, 32, 1, 28, 24, 23, 
33, 14, 33, 41, 31, 3, 22, 24, 36, 7, 40, 32, 32, 2, 30, 35, 
1, 29, 28, 4), temp = c(24.7, 24.7, 24.3, 24.8, 24.2, 24.6, 24.1, 
24.6, 25.8, 23.2, 23.7, 25.8, 18.66, 25.7, 24.8, 24.6, 21.36, 
24, 24.7, 24, 23.3, 25.7, 22.5, 24.8, 25.03, 24.9, 21.58, 25.6, 
24.7, 24.5, 25.9, 19.24, 23.4, 23.2, 25.3, 26.3, 22.5, 25, 19.32, 
24.5, 23.2, 25.7, 24.8, 26, 23.6, 25.57, 21.95, 27.1, 24.9, 24.6, 
23.8, 24.2, 26.1, 24.7, 22.9, 27.4, 26.3, 25.2, 25.4, 26.4, 19.48, 
25.82, 25, 25.15, 25.2, 24.1, 25.8, 23.04, 24.6, 24.18, 26, 22.85, 
26, 27.3, 26.9, 26.6, 25, 28.4, 19.79, 25.3, 26.3, 25.72, 24.8, 
25.29, 24.1, 25.1, 25.1, 23.1, 25.7, 26.2, 23.82, 23.9, 26.1, 
27, 25.8, 23.37, 28.5, 23.9, 26.2, 20.55, 26.6, 26.2, 25.4, 24.8, 
26.04, 25.3, 25.88, 28.6, 25.5, 26.8, 24.51, 23.7, 24.02, 25.9, 
23.3, 28.2, 25, 26.3, 21, 26.7, 28.6, 28, 25.9, 25.5, 25.8, 23.79, 
26.1, 24.7, 27.16, 25.5, 26.97, 23.7, 26.2, 25.8, 27.2, 29.9, 
24.93, 24.5, 28.6, 28.3, 27.4, 24.17, 25.8, 26.1, 23.66, 26.6, 
24.5, 28.1, 26.6, 25.8, 26.2, 22.13, 24, 27.2, 26.9, 25.3, 24.8, 
29.5, 28.06, 27.1, 27.37, 25.89, 26, 28.7, 26, 26.7, 29.2, 27.7, 
27.9, 27.2, 28.09, 26.83, 28.4, 25.52, 27.4, 28.3, 24.4, 26.1, 
26.58, 28.3, 28.94, 26.3, 29.5, 24.6, 26.48, 29.9, 29.3, 24.46
), sal = c(21.29, 33.36, 15.14, 21.77, 32.4, 22.6, 32.12, 15.49, 
11.92, 27.33, 30.53, 34.62, 32.48, 33.58, 25.2, 20.77, 27.89, 
11.36, 23.64, 31.21, 27.49, 13.21, 29.39, 31.54, 23.99, 20.4, 
25.94, 32.65, 26.36, 11.76, 13.2, 32.46, 29.36, 27.51, 31.35, 
27.92, 20.49, 32.29, 32.41, 29.26, 20.01, 20.07, 11.69, 26.48, 
25.8, 25.88, 24.12, 32.13, 29.3, 12.71, 28.69, 29.94, 25.05, 
25.01, 21.48, 31.62, 33.74, 31.89, 20.16, 27.41, 32.55, 26.18, 
27.94, 27.29, 12.98, 29.49, 25.37, 24.47, 25.29, 26.56, 15.42, 
31.41, 24.39, 28.7, 26.42, 33.79, 30.42, 31.53, 31.66, 28.33, 
25.14, 26.8, 17.55, 26.61, 29.8, 25.43, 30.31, 17.71, 13.05, 
23.33, 19.29, 26.6, 13.54, 28.12, 31.57, 29.08, 27.46, 22.86, 
24.7, 32.59, 29.62, 33.71, 16.24, 30.67, 24.28, 25.54, 26.56, 
15.19, 16.56, 22.54, 26.2, 8.76, 19.63, 31.26, 22.2, 17.99, 30.07, 
26.71, 29.02, 25.31, 29.7, 33.26, 18.74, 30.66, 28.95, 33.7, 
13.48, 30.12, 24.23, 25.18, 25.72, 7.88, 30.94, 15.33, 25.33, 
15.89, 27.07, 22.95, 29.72, 18.55, 28, 19, 29.13, 18.57, 34, 
23.11, 29.77, 32.93, 32.25, 15.67, 15.12, 30.52, 9.62, 28.82, 
29.05, 16.39, 23.45, 10.56, 23.72, 23.66, 25.49, 25.69, 27.77, 
17.2, 30.88, 14.86, 8.06, 22.97, 27.45, 16.97, 24.86, 26.03, 
17.07, 33.34, 23.65, 24.78, 10.25, 24.55, 26.69, 26.26, 25.24, 
31.83, 17.7, 10.51, 32.63, 14.04, 13.7, 32.12), DO = c(5.2, 2.7, 
5.3, 4, 4, 5.4, 5, 6.1, 4.68, 4.2, 3.17, 4.91, 5.99, 4.5, 4.9, 
5, NA, 5.9, 3.56, 3.22, 5.2, 5.25, 5.9, 2.4, 4.47, 5.6, 9.91, 
5.2, 5.9, 6.7, 6.4, NA, 5.5, 7.07, 5.17, 2.16, 4.4, 3.85, NA, 
6.8, 5.57, 5.5, 6.9, 5.05, 7.89, 4.48, 5.73, 5.3, 5.96, 7.16, 
3.92, 4.9, 4.94, 6.7, 4.46, 3.53, 5.45, 5.05, 6.2, 4.09, NA, 
4.61, 5.1, 5.76, 7.2, 4.69, 10.2, 9.87, 6.96, 7.25, 5.8, NA, 
5.64, 5.5, 7.26, 6.83, 3.35, 4.15, NA, 5.4, 3.59, 6.69, 5.3, 
6.22, 4.4, 7.98, 6.1, 8.14, 7.6, 5.03, 6.32, 7.21, 6.88, 8.69, 
10.57, NA, 6.6, 7.05, 5.41, NA, 3.61, 6.42, 6.1, 7.5, 6.06, 8.04, 
6.07, 4.94, 8.1, 5.52, 8.33, 8.82, 9.2, 4.69, 5.14, 7.18, 4.6, 
7.32, NA, 5.33, 5.9, 7.99, 10.5, 7.2, 5.3, NA, 8.4, 3.92, 8.61, 
7.85, 7.28, 8.68, 3.79, 7.2, 6.19, 7.29, 8.29, 7.8, 7.33, 12.55, 
9.88, 10.38, 5.3, 11.45, NA, 4.52, 5.5, 9.1, 7.59, 9.4, 7.7, 
NA, 8.94, 9.74, 7.8, 8.95, 9.32, 7.12, 6.76, 5.75, 7, 9.45, 6.19, 
7.84, 7.7, 8.6, 6.47, 7.6, 6.42, 12.07, 8.38, 8.58, 7.2, NA, 
8.45, 8.76, 9.51, 11.91, 8.1, 5.58, 10.13, NA, 11.72, 9.22, NA, 
7.92, 8.09, NA), water_depth = c(70, 45, 64, 76, 75, 91, 65, 
84, 80, 55, 51, 97, 62, 65, 98, 98, 58, 83, 68, 60, 80, 92, 68, 
95, 72, 101, 63, 80, 106, 103, 85, 49, 85, 72, 70, 90, 117, 95, 
58, 53, 72, 106, 102, 85, 74, 70, 62, 81, 79, 96, 79, 90, 86, 
95, 128, 101, 42, 70, 95, 100, 52, 60, 90, 52, 102, 90, 43, 64, 
96, 62, 80, 110, 105, 90, 52, 83, 70, 91, 40, 110, 105, 59, 96, 
56, 85, 102, 105, 87, 91, 103, 63, 84, 63, 62, 52, 115, 55, 83, 
104, 33, 78, 43, 80, 100, 50, 120, 72, 30, 103, 98, 74, 95, 62, 
62, 89, 57, 35, 53, 55, 85, 76, 45, 75, 79, 74, 65, 76, 50, 50, 
95, 35, 100, 62, 76, 78, 83, 60, 60, 49, 76, 50, 64, 73, 64, 
80, 64, 90, 55, 60, 57, 71, 60, 90, 67, 53, 67, 49, 61, 52, 60, 
68, 68, 70, 75, 71, 57, 63, 70, 63, 60, 70, 39, 77, 52, 75, 62, 
75, 38, 72, 66, 67, 62, 80, 80, 47, 81, 85, 49), sed_depth = c(51, 
4, 52, 47, 36, 39, 25, 54, 18, 10, 25, 78, NA, 105, 60, 35, NA, 
58, 27, 0, 15, 33, 6, 60, NA, 40, NA, 80, 34, 50, 33, NA, 39, 
15, 50, 40, 4, 80, NA, 27, 73, 40, 66, 45, NA, NA, NA, 46, NA, 
27, 50, 47, 34, 21, 7, 49, 7, 60, 28, 36, NA, NA, 30, NA, 15, 
10, 73, NA, 5, NA, 25, NA, 15, 55, 4, 81, 25, 61, NA, 35, 25, 
NA, 7, NA, 15, 63, 25, 73, 32, 27, NA, NA, 0, 3, 5, NA, 61, 52, 
70, NA, 48, 100, 37, 9, NA, 10, NA, 75, 18, 18, NA, 75, NA, 33, 
40, 35, 30, 100, NA, 65, 50, 90, 19, 61, 61, NA, 13, 35, NA, 
94, NA, 57, 50, 26, 75, 27, NA, 24, 61, 9, 68, NA, 29, 43, NA, 
30, 38, 90, 60, 2, 21, NA, 42, 55, 30, 48, 0, 69, NA, 50, NA, 
NA, 35, 13, 74, 33, 43, 35, 26, 35, NA, NA, 56, NA, 30, NA, 45, 
57, NA, 29, NA, NA, 35, 38, NA, 5, 15, NA), Month2 = structure(c(3L, 
2L, 3L, 2L, 2L, 3L, 2L, 3L, 4L, 4L, 4L, 3L, 3L, 2L, 3L, 2L, 3L, 
3L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 4L, 3L, 3L, 
4L, 4L, 3L, 4L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 
4L, 4L, 3L, 3L, 4L, 4L, 3L, 3L, 4L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 
4L, 2L, 3L, 4L, 2L, 4L, 3L, 3L, 2L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 4L, 3L, 4L, 3L, 2L, 
4L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 4L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 
4L, 4L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 4L, 2L, 
4L, 2L, 3L, 4L, 2L, 3L, 3L, 2L, 4L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 
3L, 3L, 3L, 4L, 3L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 3L, 2L, 3L, 2L, 
2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 4L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 
4L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("Jan", 
"Feb", "Mar", "Apr"), class = "factor")), row.names = c(NA, -188L
), class = c("tbl_df", "tbl", "data.frame"))

Effect sizes:

effect <- by(dry_high_samplesize, dry_high_samplesize$CYR, function(z) wilcox_effsize(temp ~ Month, data = z, 
                                                                                   mu=0, 
                                                                                   alt="two.sided", 
                                                                                   conf.int=T, 
                                                                                   conf.level=0.8, 
                                                                                   paired=F, 
                                                                                   exact=T, 
                                                                                   correct=T))

> effect
dry_high_samplesize$CYR: 2006
# A tibble: 1 × 7
  .y.   group1 group2 effsize    n1    n2 magnitude
* <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
1 temp  2      3        0.705    24    23 large    
----------------------------------------------------------------------------------------------- 
dry_high_samplesize$CYR: 2015
# A tibble: 1 × 7
  .y.   group1 group2 effsize    n1    n2 magnitude
* <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
1 temp  3      4        0.713    30    17 large    
----------------------------------------------------------------------------------------------- 
dry_high_samplesize$CYR: 2016
# A tibble: 1 × 7
  .y.   group1 group2 effsize    n1    n2 magnitude
* <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
1 temp  3      4        0.407    24    23 moderate 
----------------------------------------------------------------------------------------------- 
dry_high_samplesize$CYR: 2018
# A tibble: 1 × 7
  .y.   group1 group2 effsize    n1    n2 magnitude
* <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
1 temp  2      3        0.234    20    27 small  

enter image description here

Best Answer

  1. The $d$ figure is Cohen's $d$ [1], which is the number of population standard deviations, $\sigma$ (common to both populations), that the population means are apart ($|\mu_2-\mu_1|$), that is, $d = \frac{|\mu_2-\mu_1|}{\sigma}$ (See footnote $\dagger$). When the units of the raw variable are not especially meaningful, this is often a sensible measure of effect size in a comparison of means. This is routine in psychology and very common in the behavioural sciences more generally

    [Conventionally, being a population parameter, it should be $\delta$ rather than $d$ but psychologists writing for psychologists don't bother much with statistical convention (no disparagement intended; they have other concerns). It wouldn't matter too much if it was merely convention, but this has the unfortunate side effect of leading psychologists to completely confound the distinct concepts of the parameter and a sample estimate of it and that in turn has further consequences, including pretty directly leading to a conflation of power and sample-based estimates of it -- and so the problems that relying on post hoc power can entail. But this is perhaps not the place to pursue it further.]

  2. In Cohen's discussion $d=0.8$ is taken as a 'large' effect size and $0.5$ as 'medium' (with $0.2$ as small), so $0.7$ is close to large on that scale. I don't think these conventional sizes are particularly well justified (the labels would have to be application-dependent, for one thing), but they have become a seemingly ironclad convention in psychology, to the extent that an actually externally justified effect size (were that to turn out to be possible in some circumstance) that differed from these conventions might well have trouble being accepted. This is not a particular criticism of Cohen; any attempt at a scale for variables where the units are essentially arbitrary would encounter difficulties; it's very difficult even when the units have a direct meaning.

    If you do follow the Cohen approach, I suggest reading what he has to say on it (including in the prefaces; the second edition that I was able to take a look at includes all the earlier prefaces and there's at least some discussion of those effect sizes in the prefaces). In chapter 1 he calls his suggestions for small, medium and large for each circumstance arbitrary, and limits his scope to behavioral and biological sciences. Each section that discusses new tests / new effect size measures (largely arranged into chapters) addresses the specific values, so beside the discussion in preface and chapter 1 you would consult the relevant chapter on the procedure (at least if it's discussed in the book at all; the t-test is in chapter 2).

    I do think the justification for calling that $0.8$ value large outside of the areas it was intended for are perhaps more likely to be dubious.

  3. As far as the input to the R function goes, $d$ is the (scaled) effect size you want to attain the specified power at. With an unbiased, consistent test (such as the t-test under fairly broad assumptions), the power will increase as you move away from $H_0$ through increasing effect sizes, forming a power curve / power function. Here we're looking at the way power changes with effect size at a fixed sample size:

    power curve for t test at n=35,35

    (image taken from my answer here; in this diagram the effect size is signed, $(\mu_2-\mu_1)/\sigma$ and the power value indicated is the pure rejection rate, so here the rejection rate as you approach effect size $0$ is actually $\alpha$. The power values for the plot were generated by supplying a vector of $\delta$ values to the in-built R function power.t.test.)

    Since - as we can see - the value of power depends on the effect size and the sample size (and the significance level but I'll assume that you have chosen that already), to identify a sample size you must choose a specific effect size at which you want the given power. In the above diagram, we can see that at $35$ observations per sample, a power of $0.8$ is attained at $d=0.68$; if you only needed to guarantee that power when $d$ was $0.7$ you could get away with a slightly smaller $n$ - as you can see from the output in your question, where it's very nearly $33$.

    Since your variables have interpretable raw effect sizes, you might well prefer to work with the units of the original variable in specifying effect sizes -- e.g. to specifying a temperature effect size as a change in Celsius rather than in numbers of standard deviations. You would choose a population effect size that would be meaningful to pick up; e.g. a reasonable possibility for a 'small' effect size would be the smallest effect that would be of practical interest, while a 'large' one might be one that would be expected to have a substantive environmental effect. Be aware that I (along with most statisticians) will be an ignoramus in relation to what effect sizes will make sense in your application area -- that's a consideration for someone with some subject-matter expertise, which presumably you will have.

  4. On the "add 15%" rule of thumb, it (eventually) occurs to me where that might come from -- under the assumption of a shift alternative, and restricted to symmetric distributions, the worst-case ARE of the Wilcoxon-Mann-Whitney test to the ordinary two-sample t-test is 108/125 = 0.864 (which occurs for the location-scale family of the beta(2,2) density)

    This would suggest for that very specific situation to multiply the sample size by 1/0.8640 (i.e. add 15.74%), but that's a pretty limited circumstance (population symmetry + shift alternative + large n) – and a worst case (whereas for many distributions heavier tailed than normal you'd be able to use smaller, not larger, sample sizes). You'll also want to assume finite population variance (and indeed perhaps even something a bit stronger than that); it will not be be useful with very heavy tails.

    A suitable reference for it is [2]. (There's similar justification for the signed rank test compared to the one-sample/paired t-test. Outside those two cases I don't think that it applies.)

    I think it would be better, as far as possible, to put your effect sizes in terms of the population change that the Wilcoxon-Mann-Whitney looks at ($P(A>B)$ in the continuous case, where $A$ is a random member from one population and $B$ from the other), suitably scaled to an effect size. That should work across a wider variety of circumstances. In the case of a sample estimate I believe some people use $r=|Z|/\sqrt{n}$ or $r=|Z|/\sqrt{n-1}$ (which up to a typically negligible factor of $\sqrt{1-1/n}\approx 1-\frac{1}{2n}$) seem to be sample versions of the same sort of thing as I was describing, while SPSS uses $\hat{\eta}^2=\frac{Z^2}{n-1}$ which it appears would correspond in the population to the square of the quantity I was referring to, so it seems that's also essentially the same idea. It looks like several other suggested measures of effect size are also monotonically related to that same quantity, or nearly so.

    Sal Mangiafico discusses various measures of effect size for the test here in the context of an example involving Likert scales. While you're not dealing with Likert scales, there's some interesting comparisons there.

  5. You mention comparing medians, but you're using Wilcoxon-Mann-Whitney, which does not compare medians (in spite of quite a few books claiming otherwise). We sometimes get confused questions from readers of such books when they have identical sample medians and yet get a rejection with the Wilcoxon-Mann-Whitney, or cases where the test's estimate of the effect is in the opposite direction to the difference in medians.


$\dagger$ This is a response to the points in Jeremy Miles' comments below which grew to be many comments long

(a) Cohen's quite clearly talking about population values in his book (page refs are 2nd ed, and hereafter, for convenience, writing $m_A$ for Cohen's $\bf{\textsf{m}}_\bf{\textsf{A}}$ and so forth):

(i) "the investigator wishes to test the hypothesis that their respective population means are equal, $H_0: m_A-m_B=0$" (Sec 2.1 p19) this unambiguously establishes that his $m_A$ and $m_B$ are intended as population values; and

(ii) he then defines $d=\frac{|m_A-m_B|}{\sigma}$ (eq 2.2.2 p20) so that it is purely in terms of population quantities (as indeed he must because power is a long-run property of the test at a given alternative, not of the sample; Cohen did understand what power was).

(b) If he did somehow mean sample values in spite of explicitly saying otherwise, then all his tables are wrong, since the values he gives are definitely based off population quantities. If they were based off sample estimates the sample sizes would need to be larger to account for the uncertainties involved.

(c) I agree completely with your statement about Cohen recognizing the problem and expressing some degree of regret over events. I did not seek to lay more blame at his feet than he did $-$ indeed rather less $-$ but the issue did occur, and continues even in spite of his pointing it out, and so needs at least a passing mention here I think. I could perhaps make it more explicit than I did that this is not really his fault, and he did indeed address many of my points either right from the start or later but still quite a long time ago.


[1]: Cohen, J, Statistical Power Analysis for the Behavioral Sciences

[2]: J. L. Hodges Jr. E. L. Lehmann. "The Efficiency of Some Nonparametric Competitors of the t-Test." Ann. Math. Statist. 27 (2) 324 - 335, June, 1956. https://doi.org/10.1214/aoms/1177728261 (there's a pdf at the AOMS link)

Related Question