I have no real answer for you, only some thoughts. You are unlucky in that illness is so rare.
I'll first note that this design would have caused trouble even if illness was common. For example, the SE formula for the weighted prevalence requires $n_h$>1 observation per stratum (Cochran, 1977, Chapter 5).
You ask if it is okay to ignore the stratification and apply a formula for an exact CI. There's no real justification for this formula: the theory assumes simple random sampling (SRS). In that design every observation has the same probability of selection. In your design, a stratified sample, the probabilities range from 1/2 to 1/15, or, more formally $1/N_h$, where $N_h$ is the size of stratum h. The SRS CI endpoint will be biased if you over-sampled or under-sampled strata with higher expected prevalences.
You can, however check on this directional bias. You have some knowledge of risk predictors for the illness-the characteristics you used to form the strata. As best you can, form G groups of strata with different levels of risk and rank the groups from lowest expected risk to highest. Then plot the individual $N_h$ and the group mean $N_h$ against group number. A positive trend (average $N_h$ increasing with group number) will indicate that you under-sampled the higher risk groups. This might partly account for the failure to see any cases. A negative trend would show that you over-sampled the high risk groups. In that case the failure to see cases is partly due to bad luck and to taking too small a sample.
Theory for Simple Random Sampling without replacement
Let the unknown number of patients with illness be D, assumed >0; then the prevalence of illness $P$ is
$$
P = \frac{D}{N}
$$
Note that D can take only integer values.
Suppose number of observed patients with the condition is T. Then T has a hypergeometric distribution, not a binomial distribution, because the population size is finite (Cochran, 1977, p. 55). (This accounts for the appearance of the finite population correction for variances in sampling without replacement).
The parameters for the hypergeometric distribution are $N$, the population size, $D$ the number of patients with the illness in the population, and $n$, the sample size. The probability that $T = d$ is:
$$
\text{Pr($T =d \vert N, n,D$)} =\dfrac{ { D\choose{d}} {N -D\choose{n-d}}} {{N \choose{n}}}
$$
Confidence interval for SRS without replacement
I'll demonstrate the CI that would have been valid for a simple random sample. With population size $N$, events in the population, $d$ events in the sample, and a sample size of $n$. The one-sided $1-\alpha$ endpoint for $D$ is the largest value D for which
$$
P(T \leq d \> \vert \> N, n, D) \leq \alpha
$$
where T has a hypergeometric distribution with parameters (N, n, and D). This CI is based on inverting a hypothesis test about D. See, e.g. Blaker, 2000.
With d = 0, this is
$$
P(T =0 \> \vert \> N, n, D) \leq \alpha
$$
In your study, $N=2500$, $n= 40$, and $d=0$. Suppose this data had been generated by a SRS. I used Stata's hypergeometric function to generate a one-sided 80% CI. I choose 80% because in such a situation, my practice is to trade confidence for a smaller interval.
Under SRS, the upper bound of the one-sided 80% (actually 79.8%) hypergeometric CI for $D$ would be $D_u$ =9, which corresponds to a prevalence of $\hat{P}$= 9/250 = 3.6%. The corresponding one-sided binomial interval which ignores the finite sampling would $\hat{P}$= 3.9%. You can see that the hypergeometric interval is shorter. Both intervals are likely to be conservative, with the true probability of coverage greater than the nominal 80% (Blaker, 2000).
Actual distribution: weighted sum of Bernoulli variables
Let $h$ index strata. In stratum $h$, let $n_h$ the size of sample (=1 here), $d_h$ be the number with illness in the sample (= 0 or 1, here) , $D_h$ be the unknown number of patients in the population with illness, $P_h= D_h/N_h$ be the unknown prevalence in the strata.
If the sum of the $D_h$ is $D$ is the unknown number of ill patients in the population. The estimated prevalence is
\begin{align}
\hat{P} & = \frac{\hat{D}}{N}
\end{align}
with
\begin{align}
\hat{D} = \sum_h \dfrac{N_h}{n_h} d_h = \sum_h N_h d_h
\end{align}
With $n_h$=1, the distribution of $d_h$ is that of a Bernoulli 0-1 random variable with probability $p_h$ = $D_h/N_h$. Thus $\hat{D}$ is a weighted sum of these..
I don't know how to do a hypothesis test for $D$ in this situation; so don't have a test to invert to get a confidence interval. The problem is that there is no single probability distribution for $\hat{D}$ for each possible value $D_0$; there is a different distribution for each compatible set of the $D_h$ for which $\sum_h D_h = D_0$.
Other Designs
Confronted with a population with a rare outcome, there are not many good choices. A larger sample would have helped. For a rare outcome such as yours, I would have tried inverse sampling: sample randomly until one case was found, so that the number of trials is the random variable. There are CI formulas for the case of independent samples (See Zou, 2010), but I haven't found one for the case of without-replacement sampling, where the relevant distribution is the "negative hypergeometric", which is the same as the beta-binomial distribution,
There is a theory of optimal design, and I state it for background. According to the theory, selection probability $\pi$ for an observation should be proportional to the expected "size" of the observation, in this case its risk of disease. For stratified sampling (Cochran, 1977, Chapter 5), you'd form a small number of strata in which the observations have similar expected very low risks $P_h$, then make the selection fraction $n_h/N_h \propto P_h(1-P_h)$, which is very close to $P_h$ for small risks. It's unlikely that you'd be able to quantify actual risks, but you get the idea: higher risk patients are selected with higher probabilities.
A practical tactic is to identify a group of $N_1$ patients with risks so low that that you are very sure there are no cases among them. This leaves $N_2 = N -N_1$ people. You then omit them from the inverse sampling. If the upper endpoint CI from inverse or random sampling is $\hat{P_2}$, the estimated prevalence in the population is $\hat{P} = \dfrac{N_2}{N} \hat{P_2}$.
References
H. Blaker, 2000. Confidence curves and improved exact confidence intervals for discrete distributions. Canadian Journal of Statistics Can J Statistics 28, no. 4: 783-798.
Cochran, William G. 1977. Sampling Techniques. New York: Wiley.
Zou, G.Y. 2010. Confidence interval estimation under inverse sampling. Computational Statistics & Data Analysis 54, no. 1: 55-64.
Best Answer
This answer is based on information in the paper O'Neill (2021), which sets out mathematical properties of the Wilson score interval, including the finite population correction for inference to a finite population or the unsampled part of a finite population. Please see the linked paper for more details on the matter discussed here.
CI for the proportion in a finite population: You can find a derivation of the standard Wilson score interval in this related answer. The present analysis, adjusting for a finite population, is relatively simple to do using sample moment results found in O'Neill (2014). To facilitate this analysis, define the effective sample size for the inference as:
$$n_* \equiv n \cdot \frac{N-1}{N-n}.$$
As in the linked paper, we will derive the confidence interval via standard manipulations but starting from a pivotal quantity for the finite-population inference. Our pivotal quantity is:
$$n_* \cdot \frac{(\hat{\theta}_n-\theta_N)^2}{\theta_N (1-\theta_N)} \overset{\text{Approx}}{\sim} \text{ChiSq}(1).$$
As in the question, let $\chi_{1,\alpha}^2$ denote the critical point of the chi-squared distribution with one degree-of-freedom (with upper tail area $\alpha$). For any confidence level $1-\alpha$ we then have the probability interval:
$$\begin{align} 1-\alpha &\approx \mathbb{P} \Big( n_* (\hat{\theta}_n-\theta_N)^2 \leqslant \chi_{1,\alpha}^2 \theta_N (1-\theta_N) \Big) \\[6pt] &= \mathbb{P} \Big( n_* (\hat{\theta}_n^2 - 2 \hat{\theta}_n \theta_N + \theta_N^2) \leqslant \chi_{1,\alpha}^2 (\theta_N-\theta_N^2) \Big) \\[6pt] &= \mathbb{P} \Big( (n_* + \chi_{1,\alpha}^2) \theta_N^2 - (2 n_* \hat{\theta}_n + \chi_{1,\alpha}^2) \theta_N + n_* \hat{\theta}_n^2 \leqslant 0 \Big) \\[6pt] &= \mathbb{P} \Bigg( \theta_N^2 - 2 \cdot\frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \cdot \theta_N + \frac{n_* \hat{\theta}_n^2}{n_* + \chi_{1,\alpha}^2} \leqslant 0 \Bigg) \\[6pt] &= \mathbb{P} \Bigg( \bigg( \theta_N - \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \bigg)^2 \leqslant \frac{\chi_{1,\alpha}^2 (n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2)}{(n_* + \chi_{1,\alpha}^2)^2} \Bigg) \\[6pt] &= \mathbb{P} \Bigg( \theta_N \in \Bigg[ \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_* + \chi_{1,\alpha}^2} \cdot \sqrt{n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg] \Bigg), \\[6pt] \end{align}$$
and substitution of the observed sample proportion (for simplicity I will use the same notation for this value) then leads to the Wilson score interval:
$$\text{CI}_N(1-\alpha) = \Bigg[ \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_* + \chi_{1,\alpha}^2} \cdot \sqrt{n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$
As can be seen, the finite-population correction in the Wilson score interval consists of replacing the sample size $n$ with the effective sample size $n_*$. This is an extremely simple adjustment, and it gives a confidence interval that is valid for any population size $N \geqslant n$ (finite or infinite). It is trivial to confirm that this reduces down to the standard Wilson score interval when $N = \infty$ (giving $n_* = n$). It is also useful to note that when you take a full census of the population with $N=n$ you get $n_* \rightarrow \infty$ and the confidence interval reduces to the single point $\text{CI}_n(1-\alpha) = [\hat{\theta}_n]$, just as we would expect.
CI for the unsampled proportion in a finite population: A useful variant of this problem occurs when you are interested in making an inference about the proportion of positive outcomes in the unsampled part of a finite population, which is $\theta_{n:N} \equiv \sum_{i=n+1}^N X_i/(N-n)$. In this case, we can use the pivotal quantity:
$$n_{**} \cdot \frac{(\hat{\theta}_n-\theta_N)^2}{\theta_N (1-\theta_N)} \overset{\text{Approx}}{\sim} \text{ChiSq}(1) \quad \quad \quad n_{**} \equiv n \cdot \frac{N-n}{N-1},$$
which uses the alternative value $n_{**}$ for the effective sample size. The remaining calculations are the same as above, except that $n_{**}$ replaces $n_{*}$ in the formulae, giving the confidence interval:
$$\text{CI}_{n:N}(1-\alpha) = \Bigg[ \frac{n_{**} \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_{**} + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_{**} + \chi_{1,\alpha}^2} \cdot \sqrt{n_{**} \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$
It is again trivial to confirm that this reduces down to the standard Wilson score interval when $N = \infty$ (giving $n_{**} = n$). It is also useful to note that when you take a full census of the population with $N=n$ you get $n_{**} = 0$ and the confidence interval reduces to the vacuous interval $\text{CI}_{n:n}(1-\alpha) = [0,1]$, just as we would expect.
Implementation in R: These generalised confidence intervals are implemented in the
CONF.prop
function in thestat.extend
package. The function implements the Wilson score intervals in the form shown above. By default the function uses the standard interval for the proportion parameter for an infinite population. However, the function also allows specification of a population sizeN
and a logical valueunsampled
to give the above interval forms.