According to the problem you described, you want to set the death rate =20% for the reference or control group, effect.size=-3 (this will help set the death rate in the treated group to 80%) in LRPower() function:
LRPower(100, reference.group.incidence=0.2, effect.size = -3, simulation.n = 5000)
[1] 0.9948
LRPower(40, reference.group.incidence=0.2, effect.size = -3, simulation.n = 5000)
[1] 0.797
Thus, you need 20 control and 20 treated animals to distinguish 80% death rate in treated from 20% death rate in control group with 79.7% power while holding significance level at 0.05. In LRPower() function, the default type I error is 0.05, and the default group.sample.size.ratio=1.
For the reason why you need to set effect.size = -3, please check out the method file here: https://github.com/RongUTSW/Methods/blob/master/LRPowerSimulation.pdf
The power.t.test
function can only calculate power for the t-test.
If you don't know how to compute power for the other tests, you'd use simulation - i.e. simulate from some given distribution under the conditions given.
You don't say what distribution you need to do it for; presumably at the normal (but you should check carefully).
So you repeat many times the action of simulating a pair of samples of size 10 with the given effect size and then compute whether each test rejects or not (or alternatively, record the p-values, which you later compare with the significance level).
You don't need to write functions to conduct each of the tests, since R already has functions that do all of those for you. And I'd suggest writing a function to simulate a single pair of samples under the required conditions and call each of the functions for the different tests, and then gather up only the information from each test you need (I would suggest getting the p-values) and then using replicate
to call that function to do the simulations and allow you to save the results.)
You may not be required to do so, but it makes sense to also compute the actual type I error rate - the rejection rate at effect size 0, since neither the Mann-Whitney nor the Welch tests will not be carried out at exactly the nominal rate, but some other rate (if you're actually testing at 3.6% instead of 5% you would expect lower power, because the test is being conducted at a lower type I error rate).
[For the tests to be actually comparable, you should conduct them at the same rate. Indeed, ideally, you would probably treat the impact on power and significance level as separate issues, by finding the different actual significance levels and then either carrying them all out at as near to the same significance level as possible. This would either involve $\ $ (a) carrying out the t-test at the actual level of the Mann-Whitney and then adjusting the Welch nominal level so that it had approximately the same significance level, or $\ $ (b) using a randomized test to carry out the Mann-Whitney at a 5% level and (again) adjusting the nominal level of the Welch test so the actual significance level is close to 5%. I expect you're not required to do this though.]
I'd suggest a simulation size of at least 10000. You can calculate the standard error of the rejection rate estimate from the binomial distribution.
Best Answer
Comment: First, I would suggest you consider carefully whether you have a really good reason to use different sample sizes. Especially if the smaller sample size is used for the group with the larger population, this is not an efficient design.
Second, you can use simulation to get the power for various scenarios. For example, if you use $n_1 = 20,\, \sigma_1 = 15,\,$ $n_2 = 50, \sigma_2 = 10,$ then you have about 75% power for detecting a difference $\delta = 10$ in population means with a Welch test at level 5%.
Because the P-value is taken directly from the procedure
t.test
in R, results should be accurate to 2 or 3 places, but this style of simulation runs slowly (maybe 2 or 3 min.) with a million iterations.You might want to use 10,000 iterations if you are doing repeated runs for various sample sizes, and then use a larger number of iterations to verify the power of the final design.
Changing to $n_2 = 20$ gives power 67%, so the extra 30 observations in Group 2 are not 'buying' you as much as you might hope. By contrast, a balanced design with $n_1 = n_2 = 35$ gives about 90% power (with everything else the same).