Effect Size and Bootstrapping – Effect Size and Bootstrapping Methods in Paired t-Test

bootstrapeffect-sizepaired-datat-test

I have multiple paired $t$-tests, such as one giving results:

$t_{14} = 2.7,\ p = .017$

Although people seem to do effect sizes in different ways in repeated samples, I have taken the mean difference divided by the standard deviation of the differences (I'll call this $d$, though maybe I should call it something else?) and get $0.70$. I also have a very strong correlation between the samples, not sure if that is problematic.

I would like to put confidence limits around my effect size estimate. To do so, I randomly resample from the difference scores, compute $d$ in the same way and repeat 1000 times. My question is whether this is a good approach, rather than, say, just giving confidence limits around the unstandardised difference or resampling from the original samples. My bootstrap gives me a mean $d$ of $0.79$ with confidence limits of $[0.4, 1.4]$. I've tried this on other random data too. Why am I getting a consistently higher $d$ from bootstrapping, and why are the intervals asymmetric? Is this because of skew in the (difference) scores, and does this make this approach more or less robust?

Edit: here is an example of the data involved. 15 people were measured two times.

Mean A = 1742; SD = 435
Mean B = 1820; SD = 426
Mean difference = 78, SD of differences = 111, $d$ = 0.70

Best Answer

I will attempt to answer but I am not totally sure on my own knowledge on the subject.

Bootstrap, as far as I know is always done on the original data. In your case the original data is pairs of data. So to do a bootstrap, you would have to random sample (with replacement) on the pairs of the original data. That is equivalent to do the bootstrap on the difference scores and performing the effect size calculation as you described on the samples.

I get a different result from you (in R)

a=read.table(header=F,text="
1999 2040
1501 1601
1552 1623
2385 2386
2488 2671
1257 1218
1806 1719
1348 1405
2048 2079
1810 2017
1308 1356
2310 2324
1247 1616
1839 1878
1235 1370
")
d=a$V2-a$V1
mean(d)/sd(d)
[1] 0.7006464
aux=function(x,i) mean(x[i])/sd(x[i])
bb=boot::boot(d,aux,R=1000)
mean(bb$t)
[1] 0.7530415
boot::boot.ci(bb)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot::boot.ci(boot.out = bb)

Intervals : 
Level      Normal              Basic         
95%   ( 0.1840,  1.0846 )   ( 0.1454,  1.0570 )      

Level     Percentile            BCa          
95%   ( 0.3443,  1.2559 )   ( 0.1634,  1.0722 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

(code corrected as per the comments)

Indeed the direct calculation of the effect size (mean(d)/sd(d)) is not similar to the bootstrap calculation (mean(bb$t)). I dont know how to explain it

The only confidence interval that matches yours in the percentile (I dont really know which interval to choose on theoretical grounds - I use the BCa - I think it was suggested somewhere)

The second way to calculate a CI on effect size is to use analytical formulas. This question on CV discussed the formulas How can i calculate the 95% confidence interval of an effect size if I have the mean difference score, CI of that difference score

Using the MBESS package I get the following CI

MBESS::ci.sm(Mean = mean(d), SD=sd(d),N=length(d))
[1] "The 0.95 confidence limits for the standardized mean are given as:"
$Lower.Conf.Limit.Standardized.Mean
[1] 0.1231584

$Standardized.Mean
[1] 0.7006464

$Upper.Conf.Limit.Standardized.Mean
[1] 1.258396

As for your suggestion on computing the confidence interval for the difference score and using it to compute a confidence interval on the effect size, I have never heard of it, and I would suggest not using it.

Related Solutions

Solved – Paired versus unpaired t-test

I agree with the points that both Frank and Peter make but I think there is a simple formula that gets to the heart of the issue and may be worthwhile for the OP to consider.

Let $X$ and $Y$ be two random variables whose correlation is unknown.

Let $Z=X-Y$

What is the variance of $Z$?

Here is the simple formula: $$ \text{Var}(Z)=\text{Var}(X) + \text{Var}(Y) - 2 \text{Cov}(X,Y). $$ What if $\text{Cov}(X,Y)>0$ (i.e., $X$ and $Y$ are positively correlated)?

Then $\text{Var}(Z)\lt \text{Var}(X)+\text{Var}(Y)$. In this case if the pairing is made because of positive correlation such as when you are dealing with the same subject before and after intervention pairing helps because the independent paired difference has lower variance than the variance you get for the unpaired case. The method reduced variance. The test is more powerful. This can be dramatically shown with cyclic data. I saw an example in a book where they wanted to see if the temperature in Washington DC is higher than in New York City. So they took average monthly temperature in both cities for say 2 years. Of course there is a huge difference over the course of the year because of the four seasons. This variation is too large for an unpaired t test to detect a difference. However pairing based on the same month in the same year eliminates this seasonal effect and the paired $t$-test clearly showed that the average temperature in DC tended to be higher than in New York. $X_i$ (temperature at NY in month $A$) and $Y_i$ (temperature in DC in month $A$) are positively correlated because the seasons are the same in NY and DC and the cities are close enough that they will often experience the same weather systems that affect temperature. DC may be a little warmer because it is further south.

Note that the large the covariance or correlation the greater is the reduction in variance.

Now suppose $\text{Cov}(X,Y)$ is negative.

Then $\text{Var}(Z) \gt \text{Var}(X)+\text{Var}(Y)$. Now pairing will be worse than not pairing because the variance is actually increased!

When $X$ and $Y$ are uncorrelated then it probably doesn't matter which method you use. Peter's random pairing case is like this situation.

T-Test – Critical Effect Sizes and Power for Paired T-Tests

Yes, this is possible and even fairly easy, but additional information is required. Specifically, we have to make an assumption about what the correlation between the observations from each pair are.

The effect size as a difference in standard deviation units is usually referred to as $d$. We can apply a correction factor to $d$ to incorporate the information about the aforementioned correlation, and then we can use our standard power formulae with this corrected $d$ (making sure to also mind the change in degrees of freedom associated with moving to the paired design) to compute power. The corrected $d$ is $$ d_o = \frac{d}{\sqrt{1-r}}, $$ where $r$ is the correlation. I have called this $d_o$ because this is sometimes referred to as the "operative effect size."

Here is a little R routine that computes a table of minimum number of PAIRS as a function of the assumed correlation and the desired power level, with $d=2$ assumed.

library(pwr) # package for pwr.t.test() function
             # may need to install first with install.packages()

# define a function to get the minimum number of pairs
# for a given correlation and desired power level
getN <- function(r,p){
  unlist(mapply(pwr.t.test, d=2/sqrt(1-r), power=p,
    MoreArgs=list(n=NULL, sig.level=.05, type="paired"))["n",])
}

# apply this function to all combinations of the parameters below
tab <- outer(seq(0,.95,.05), c(.7,.8,.9,.95,.99,.999), "getN")
dimnames(tab) <- list("Correlation"=seq(0,.95,.05),
                  "DesiredPower"=c(.7,.8,.9,.95,.99,.999))
tab

Which returns the following:

           DesiredPower
Correlation      0.7      0.8      0.9     0.95     0.99    0.999
       0    3.767546 4.220731 4.912411 5.544223 6.888820 8.656788
       0.05 3.691858 4.126240 4.787326 5.389850 6.669683 8.350091
       0.1  3.615930 4.031562 4.662220 5.235637 6.451021 8.044096
       0.15 3.539645 3.936653 4.537050 5.081483 6.232774 7.738792
       0.2  3.462940 3.841433 4.411750 4.927417 6.014903 7.434270
       0.25 3.385708 3.745774 4.286234 4.773338 5.797404 7.130529
       0.3  3.307922 3.649640 4.160447 4.619143 5.580267 6.827580
       0.35 3.229382 3.552889 4.034209 4.464751 5.363362 6.525430
       0.4  3.149970 3.455310 3.907393 4.309986 5.146613 6.224026
       0.45 3.069435 3.356743 3.779777 4.154653 4.929824 5.923282
       0.5  2.987581 3.256903 3.651065 3.998456 4.712773 5.623032
       0.55 2.904079 3.155423 3.520841 3.841020 4.495111 5.323066
       0.6  2.818472 3.051834 3.388672 3.681805 4.276260 5.022875
       0.65 2.730145 2.945449 3.253781 3.520048 4.055501 4.721751
       0.7  2.638237 2.835369 3.115118 3.354639 3.831565 4.418560
       0.75 2.541442 2.720152 2.971074 3.183823 3.602697 4.111397
       0.8  2.437713 2.597460 2.819127 3.004879 3.365682 3.796879
       0.85 2.323340 2.463226 2.654597 2.812710 3.114890 3.468745
       0.9  2.190677 2.309002 2.467897 2.596901 2.838233 3.113596
       0.95 2.018024 2.110699 2.231866 2.327720 2.501567 2.692358

Note that $d=2$ is considered in many fields quite a large effect size, so the resulting minimum numbers of pairs are all quite low.

Best Answer

Related Solutions

Solved – Paired versus unpaired t-test

T-Test – Critical Effect Sizes and Power for Paired T-Tests

Related Question