First note, that you are looking for a confidence interval for a mean value (not for a proportion) of n "average scores" $x_1,\ldots,x_n$, so approximately you can use the confidence interval based on the t-distribution ($s^2$ denotes the estimated variance):
$$\overline{x}\pm t_{1-\alpha/2}(n-1)\cdot \sqrt{s^2}$$
As your observables ("average scores") $x_i$ are limited to the range $x_i\in [0,1]$ there is even a strict analytical formula for a $1-\alpha$ confidence interval which can be derived from Hoeffding's inequality:
$$\overline{x}\pm \sqrt{-\frac{\ln(\alpha/2)}{2n}}$$
This interval has a strict (!) coverage probability greater than $1-\alpha$. To prove this, start with Hoeffding's inequality with $a_i=0\leq x_i\leq 1=b_i$:
$$P\left(|\overline{x}-\mu|\geq\frac{t}{n}\right)\leq 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)$$
The sum in the denominator of the right hand side is $n$, and setting $s=t/n$ yields the inequality
$$P\left(|\overline{x}-\mu|\geq s\right)\leq 2\, e^{-2ns^2}$$
Now setting the right hand side to $\alpha$ yields the above confidence interval.
A more general approach that does not require an analytical approximation is repeated sampling with replacement, aka "bootstrap". There are different ways to obtain confidence intervals via bootstrapping, and one of the better methods is the "bias corrected accelerated bootstrap" (BCa). In R, you can compute it as follows:
# function that computes your observable from data
observable <- function(x, indices) {
x.bootstrap <- x[indices]
... # compute your observable from x.bootstrap
return obs
}
# computation of non-parametric bootstrap intervals
library(boot)
boot.out <- boot(data=x, statistic=observable, R=1000)
ci <- boot.ci(boot.out, conf=0.95, type="bca")
I have done some comparative studies and found the BCa inerval to have better coverage probabilities than the other bootstrap intervals, but poorer coverage probability than analytic approximations if an analytic approximation is possible (see the reference below). I would thus conjecture that, in your use case, the simple approach over the t-distribution will have better coverage probability than bootstrap methods.
Dalitz: "Construction of confidence intervals." Technical Report No. 2017-01, pp. 15-28, Hochschule Niederrhein, Fachbereich Elektrotechnik und Informatik, 2017
Best Answer
Let $X_{j,1},X_{j,2},...,X_{j,20}\sim F(\theta_j,k_j)$ be the 20 observations in batch $j=1,...,10$ that follow a particular distribution with shape or location parameter $\theta_j$ and scale parameter $k_j$. Pooling all 200 observations when performing inference on these parameters or a function of these parameters assumes the parameters are all the same, i.e. $\theta_j=\theta$ or $g(\theta_j,k_j)=g(\theta,k)$ $\forall j$. There is no batch effect. In contrast, performing inference on these parameters or a function of these parameters by batch assumes the parameters are all different. That you get wider confidence limits for the pooled analysis suggests to me the parameters are not all the same. There is a batch effect.
Treating all of the parameters as fixed quantities, each batch can be viewed as a sample from a subpopulation. Since you have equal sample sizes in each batch I suppose you could assume the parameters are all different and think of the pooled analysis as investigating something like $\theta\equiv\frac{1}{10}\sum \theta_j$. If, say, we were sampling people at random from a broader target population then the fixed $\theta$ could be viewed as a weighted average of fixed subpopulation $\theta_j$'s. Then $\theta$ would be the overall population parameter and might certainly be a meaningful quantity to consider.
I'm not usually a fan of treating parameters as random variables, but here it might make sense. If there is truly a different batch effect each time you produce outputs you could view each batch parameter as the sampling unit with the 20 observations as repeated measurements. This would be akin to repeated measurements on subjects participating in a clinical trial. If we were to repeat the trial many times we would use different subjects so subject is our sampling unit with repeated measures. If you were to repeat your experiment many times you would have different batch parameters so $\theta_j$ and $k_j$ would be your sampling unit. You could accomplish this using a mixed model or a covariance pattern model. The mixed model would allow you to perform inference on individual batch parameters as well as the marginal parameters that govern the sampling of batch effect. The covariance pattern model would allow you to perform inference only on the marginal parameters that govern the sampling of batch effect. Going back to the clinical trial experiment, a mixed model would allow you to perform inference on individual subject parameters as well as the target patient population parameters. A covariance pattern model would be useful only for inference on the target patient population parameters.
Let me know if there is anything I have misunderstood.