Solved – Bootstrap Confidence Intervals for Weir & Cockerham’s Fst

bootstrapgeneticsjackknife

I'm working on calculating bootstrap confidence intervals for Weir & Cokerham's Fst.
I want to use the percentile-t method as described in this paper.

I'm calculating the $F_{st}$ value between two sub-populations in the single allele, mulitple loci case as:
$$
\hat \theta_W = \frac {\sum \limits_l a_l}{\sum \limits_l (a_l + b_l + c_l)}
$$

as described in Weir & Cockerham's 1984 paper.

My current plan is to estimate the variance of $\hat \theta$ by jackknifing over the loci:
$$
var(\hat \theta) \mathrel{\hat{=}} \frac {m – 1} {m} \sum \limits_{L = 1}^{m} (\hat \theta_{(L)} – \frac {1} {m} \sum \limits_{L = 1}^{m} \hat \theta_{(L)})^2
$$

where $\hat \theta_{(L)}$ is the estimate of $\hat \theta$ obtained by omitting locus $L$ and $m$ is the number of loci.

To get the confidence intervals, my steps are:

  1. To perform around 1,000 bootstrap replicates (I haven't worked out how many will be appropriate, I was going to try this as a base, then go up.)

  2. In each round, I would gather a simple random sample with replacement of the loci and calculate the $F_{st}$ value ($\hat \theta^*$).

  3. I would then using jackknifing to estimate the variance of $\hat \theta^*$. (I would use the same method as above)

  4. Then I would calculate the t-statistic: $t^* = \frac {\hat \theta^* – \hat \theta} {\hat se_{\hat \theta^*}}$

  5. Once the for loop is done. I would obtain the confidence interval with:
    $$
    (\hat \theta – t^*_{(1 – \alpha / 2)} \hat se_{\hat \theta}; \hat \theta – t^*_{(\alpha / 2)} \hat se_{\hat \theta})
    $$

(The confidence interval is from the Wikipedia entry on bootstrapping)

My main questions are:

  1. Is this the correct way to calculate the studentized bootstrap confidence intervals?

  2. Weir and Cockerham suggest that one could also bootstrap over samples instead of loci, the first paper I cited and Weir & Cockerham bootstrap over loci, is this the best approach for comparing two populations? I was going with because my understanding was that we can assume independence between locus (assuming little dependence by linkage) and might not be able between samples in a sub-population as they might be related.

Let me know if there are any errors, or if more info is needed.

Thank you!

Edit:
Here's some R code I've written to do this.
First I find the a, b, and c arrays and calculate $\hat \theta$, then I do this to find it's variance:

findVar <- function(a, b, c) {
  theta.L <- c(rep(0, 10))
  for (i in 1:ncol(data)) {
    # remove locus i
    theta.L[i] <- sum(a[-i]) / sum(a[-i] + b[-i] + c[-i])
  }
  m <- ncol(data)
  mean.theta <- sum(theta.L) / m
  var.theta <- ((m - 1) / m) * sum((theta.L - mean.theta)^2)
  return (var.theta)
}
var.theta <- findVar(a, b, c)

Then I did this to find the bootstrap confidence intervals:

t.i <- c(rep(0, 10))
m <- ncol(data)
aTemp <- c(rep(0, 10))
bTemp <- c(rep(0, 10))
cTemp <- c(rep(0, 10))
for (i in 1:1000) {
 list <- sample(1:m, m, replace = TRUE) 
 for (j in 1:m) {
   aTemp[j] <- a[list[j]]
   bTemp[j] <- b[list[j]]
   cTemp[j] <- c[list[j]]
 }
 theta.i <- sum(aTemp) / sum(aTemp + bTemp + cTemp)

 # find the variance
 var.theta.i <- findVar(aTemp, bTemp, cTemp)

 t.i[i] <- (theta.i - theta) / var.theta.i
}

alpha <- 0.05
t.i <- sort(t.i)
firstT <- t.i[ceiling(length(t.i) * (1 - alpha / 2))]
secondT <- t.i[ceiling(length(t.i) * (alpha/2))]

lower <- theta - (firstT * var.theta)
upper <- theta - (secondT * var.theta)

Here's a sample result:

Theta:  -0.1748
Confidence Interval: ( -0.2244 ,  0.6153 )

Note: I didn't have very large data set, just wanted to see if it worked, maybe.

Edited again, to fix a mistake with getting the $\alpha / 2$ and $1 – \alpha / 2$ percentiles, and changed the example results

Best Answer

  1. The method you outlined of computing the Studentized bootstrap confidence intervals looks fine to me.

  2. As long as you have data from multiple loci and are using that to estimate one value of $F_{st}$, bootstrapping over loci using a Studentized bootstrap confidence interval should be the way to go. The first paper you cited demonstrates, through simulation studies, that the percentile-t method used on Weir and Cockerham's multiple-locus estimator $\hat{\theta}_{loci}$ will give good results. Thus, you should feel confident with using that method.

Related Question