You can do this using simulation.
Write a function that does your test and accepts the lambdas and sample size(s) as arguments (you have a good start above).
Now for a given set of lambdas and sample size(s) run the function a bunch of times (the replicate function in R is great for that). Then the power is just the proportion of times that you reject the null hypothesis, you can use the mean function to compute the proportion and prop.test to give a confidence interval on the power.
Here is some example code:
tmpfunc1 <- function(l1, l2=l1, n1=10, n2=n1) {
x1 <- rpois(n1, l1)
x2 <- rpois(n2, l2)
m1 <- mean(x1)
m2 <- mean(x2)
m <- mean( c(x1,x2) )
ll <- sum( dpois(x1, m1, log=TRUE) ) + sum( dpois(x2, m2, log=TRUE) ) -
sum( dpois(x1, m, log=TRUE) ) - sum( dpois(x2, m, log=TRUE) )
pchisq(2*ll, 1, lower=FALSE)
}
# verify under null n=10
out1 <- replicate(10000, tmpfunc1(3))
mean(out1 <= 0.05)
hist(out1)
prop.test( sum(out1<=0.05), 10000 )$conf.int
# power for l1=3, l2=3.5, n1=n2=10
out2 <- replicate(10000, tmpfunc1(3,3.5))
mean(out2 <= 0.05)
hist(out2)
# power for l1=3, l2=3.5, n1=n2=50
out3 <- replicate(10000, tmpfunc1(3,3.5,n1=50))
mean(out3 <= 0.05)
hist(out3)
My results (your will differ with a different seed, but should be similar) showed a type I error rate (alpha) of 0.0496 (95% CI 0.0455-0.0541) which is close to 0.05, more precision can be obtained by increasing the 10000 in the replicate command. The powers I computed were: 9.86% and 28.6%. The histograms are not strictly necessary, but I like seeing the patterns.
The Likelihood ratio test you're using uses a chi-square distribution to approximate the null distribution of likelihoods. This approximation works best with large sample sizes, so its inaccuracy with a small sample size makes some sense.
I see a few options for getting better Type-I error in your situation:
- There are corrected versions of the likelihood ratio test, such as Bartlett's correction. I don't know much about these (beyond the fact that they exist), but I've heard that Ben Bolker knows more.
- You could estimate the null distribution for the likelihood ratio by bootstrapping. If the observed likelihood ratio falls outside middle 95% of the bootstrap distribution, then it's statistically significant.
Finally, the Poisson distribution has one fewer free parameter than the negative binomial, and might be worth trying when the sample size is very small.
Best Answer
Empirical size refers to the possibility that the nominal size that the user of the test chooses (say, 5%) may not coincide with the actual rejection frequency of the test. This for example may be the case when some assumption required for the test statistic's property is not met. E.g., many null distributions are derived asymptotically, i.e., under the assumption that $n\to\infty$. In finite samples, the empirical size may (and generally will) then differ from 5%.
A simple example is given by the t-test when sampling from a normal population. Here, we are in the exceptional situation that we may actually derive the null distribution ($t(n-1)$). If we approximate the null distribution by the normal distribution, as justified by the CLT as $n\to\infty$ we would be using 1.96 as the (two-sided) critical value, although the .975 c.v. of the $t(n-1)$ distribution would be more accurate. The fraction of rejections when using 1.96 as the critical value then is the empirical size, and the difference to 0.05 is commonly called size distortion.
Unfortunately, we do not know the exact finite-sample distribution in most cases. One alternative is to resort to simulation studies.
For my example (where it is actually superfluous, because analytical results are available), this might look as follows:
The results confirm the analytical predictions:
Empirical size is therefore at best indirectly related to power, as it deals with rejection rates under the null. They are indirectly related because if a test is liberal (i.e., empirical size > nominal size) it will reject too often if the null is true, and will therefore typically also reject more often when the null is false, i.e., have higher power. That is however typically not viewed as a good thing because size is not "controlled" and rejections therefore are spurious.