If one wants to perform an A/B-Test with a small baserate not just for funsies, one has to ask what effect size i.e. which absolute improvement is considered to be worth the effort.
For example, if p=1/10^6 and number-of-visitors-per-month=10^6, then even an relative improvement of 500 % means an absolute improvement of 4 more conversions on average. If such differences cannot be justified with monetary arguments (e.g. the website is selling trips to space), an A/B-Test is not worth the trouble.
However, if such differences are considered to be worth the effort, I suggest to break down / decompose the conversion-rate into the participating factors. For example, let's say that one measures conversions as $\frac{boughtSpaceTrips}{siteVisitors}$. This rate can be splitted into ...
$\frac{boughtSpaceTrips}{siteVisitors} = \frac{boughtSpaceTrips}{spaceTripsInBasket} * \frac{spaceTripsInBasket}{siteVisitors}$
This decomposition may allow one to detect differences in a decomposed ratio, which do not appear in the composed ratio because they are countered by the other ratios (negative correlation) or have such a small contribution weight, that it requires the mentioned ton of data to do so. Whether there is some sort of negative correlation between the decomposed factors can be decided by applying domain knowledge, for example, how much does it "cost" for the user to perform a certain action.
In the given constructed example, the reasoning
Improve $\frac{boughtSpaceTrips}{spaceTripsInBasket}$ => Improve $\frac{boughtSpaceTrips}{siteVisitors}$
is valid, but the other way around
Improve $\frac{spaceTripsInBasket}{siteVisitors}$ => Improve $\frac{boughtSpaceTrips}{siteVisitors}$
is not.
If the decomposition does not lead to more feasible base rates, then take a look at the statistical discipline for this kind of problem (keyword: "rare event(s)"). But in this case you go beyond the scope of normal A/B-Tests, so I would ask again, whether this is worth the effort. Aside, my intuition tells me that one cannot avoid the pillars of the universe, so rare events still require a lot of data (but maybe not a ton), no matter which fancy method is applied (domain knowledge may help a lot though).
I have re-think your problem and found Friedman's test which is a non-parametric version of a one way ANOVA with repeated measures.
I hope you have some basic skills with R
.
# Creating a source data.frame
my.data<-data.frame(value=c(2,7,7,3,6,3,2,4,4,3,14,167,200,45,132,NA,
245,199,177,134,298,111,75,43,23,98,87,NA,300,NA,118,202,156,23,34,98,
112,NA,200,NA,156,54,18,NA),
post.no=rep(c("baseline","post1","post2","post3"), each=11),
ID=rep(c(1:11), times=4))
# you must install this library
library(pgirmess)
Perform test Friedman's test...
friedman.test(my.data$value,my.data$post.no,my.data$ID)
Friedman rank sum test
data: my.data$value, my.data$post.no and my.data$ID
Friedman chi-squared = 14.6, df = 3, p-value = 0.002192
and then find between which groups the difference exist by non-parametric post-hoc test.
Here you have all possible comparisons.
friedmanmc(my.data$value,my.data$post.no,my.data$ID)
Multiple comparisons between groups after Friedman test
p.value: 0.05
Comparisons
obs.dif critical.dif difference
baseline-post1 25 15.97544 TRUE
baseline-post2 21 15.97544 TRUE
baseline-post3 20 15.97544 TRUE
post1-post2 4 15.97544 FALSE
post1-post3 5 15.97544 FALSE
post2-post3 1 15.97544 FALSE
As you can see only baseline (first time point) is statistically different from others.
I hope this will help you.
Best Answer
Let's take a stab at a first-order approximation assuming simple random sampling and a constant proportion of infection for any treatment. Assume the sample size is large enough that a normal approximation can be used in a hypothesis test on proportions so we can calculate a z statistic like so
$z = \frac{p_t - p_0}{\sqrt{p_0(1-p_0)(\frac{1}{n_1}+\frac{1}{n_2})}}$
This is the sample statistic for a two-sample test, new formula vs. bleach, since we expect the effect of bleach to be random as well as the effect of the new formula.
Then let $n = n_1 = n_2$, since balanced experiments have the greatest power, and use your specifications that $|p_t - p_0| \geq 0.1$, $p_0 = 0.2$. To attain a test statistic $|z| \geq 2$ (Type I error of about 5%), this works out to $n \approx 128$. This is a reasonable sample size for the normal approximation to work, but it's definitely a lower bound.
I'd recommend doing a similar calculation based on the desired power for the test to control Type II error, since an underpowered design has a high probability of missing an actual effect.
Once you've done all this basic spadework, start looking at the stuff whuber addresses. In particular, it's not clear from your problem statement whether the samples of poultry measured are different groups of subjects, or the same groups of subjects. If they're the same, you're into paired t test or repeated measures territory, and you need someone smarter than me to help out!