I'll give an explanation that is very close to that of Maarten Buis but just a little more elaborate. As always in survival analysis, different time scales can be applied. I think that age is maybe the more intuitive time scale in your setting, so that's where I'll start my answer. Afterwards, I'll try to use that intuition to answer the question.
Let $C_i$ be time of birth. From your data we can easily calculate ages of entering the study,
$$
A_i = t_0 - C_i
$$
and age of exiting the study,
$$
B_i = \min\{T - C_i, D_i\},
$$
where $D_i$ is age at death. Now note, that we have some age interval, $(A_i, B_i]$ where the $i$'th subject is under observation. On this time scale, the study subjects do not enter the study at the same time. Let's denote the minimum of the age at entering the study,
$$
\alpha = \min_i A_i.
$$
What survival information do we have before time $\alpha$? None. This is why we can't say anything about the probability of surviving the age interval $(0, \alpha]$. Necessarily, our Kaplan-Meier estimate must be conditional on survival until age $\alpha$. To give an example: Let's say that $\alpha$ is $1$ year. Would we be able to calculate the survivor function at time $5$ years, $S(5) = P(D > 5)$. Could we calculate how many children would live to see their fifth birthday? No, because we simply don't know how dangerous the first year is. We can calculate only the conditional survivor function $P(D > 5|D > 1)$. Actually, this can again be explained by a change in time scale: there is nothing special about 0, your Kaplan-Meier estimate doesn't have to start at time zero, it can start at some other time, which corresponds to e.g. the time scale defined by age minus $\alpha$. In your data, you write that $\alpha$ is very small as some children are included very young, thus, for $s > \alpha$
$$
P(D > s | D > \alpha) = S(s)/S(\alpha) \simeq S(s)
$$
and actually there is equality in the limit $\alpha \rightarrow 0$ if we assume $S$ to be continuous.
Let's change back to your original time scale, plain calender time. You have no idea how dangerous the time before $t_0$ is, therefore your estimate must be conditional on surviving until $t_0$. This stems from the fact that no children are observed before time $t_0$. On this time scale, it doesn't make much of a difference how close the times of birth are to $t_0$ as we have assumed the same hazard for all ages (instead of an age-specific hazard as above). To sum up, on this time scale (using calender time), the interpretation would of the Kaplan-Meier estimate would be that of (for $t \in (t_0, T]$),
$$
P(X > t | X > t_0).
$$
This is not as intuitive as on the age time scale, however, it just means that when doing a study in calender time, we condition on the subjects having survived the time from birth until the start of our study.
To answer the last part of the question, you do not condition on $T_{first}$ nor on $L_1$, you condition on survival until $t_0$ as this is the minimum of entering times. I think part of the confusion is due to the fact that all the children enter your study at the same time, which is not necessarily the case in all applications, as is evident from using age as the time scale above.
Finally, you could easily say that non-truncation corresponds to the truncation time being smaller than or equal to 0 (or some other natural starting point on a time scale).
This answer is based on information in the paper O'Neill (2021), which sets out mathematical properties of the Wilson score interval, including the finite population correction for inference to a finite population or the unsampled part of a finite population. Please see the linked paper for more details on the matter discussed here.
CI for the proportion in a finite population: You can find a derivation of the standard Wilson score interval in this related answer. The present analysis, adjusting for a finite population, is relatively simple to do using sample moment results found in O'Neill (2014).
To facilitate this analysis, define the effective sample size for the inference as:
$$n_* \equiv n \cdot \frac{N-1}{N-n}.$$
As in the linked paper, we will derive the confidence interval via standard manipulations but starting from a pivotal quantity for the finite-population inference. Our pivotal quantity is:
$$n_* \cdot \frac{(\hat{\theta}_n-\theta_N)^2}{\theta_N (1-\theta_N)} \overset{\text{Approx}}{\sim} \text{ChiSq}(1).$$
As in the question, let $\chi_{1,\alpha}^2$ denote the critical point of the chi-squared distribution with one degree-of-freedom (with upper tail area $\alpha$). For any confidence level $1-\alpha$ we then have the probability interval:
$$\begin{align}
1-\alpha
&\approx \mathbb{P} \Big( n_* (\hat{\theta}_n-\theta_N)^2 \leqslant \chi_{1,\alpha}^2 \theta_N (1-\theta_N) \Big) \\[6pt]
&= \mathbb{P} \Big( n_* (\hat{\theta}_n^2 - 2 \hat{\theta}_n \theta_N + \theta_N^2) \leqslant \chi_{1,\alpha}^2 (\theta_N-\theta_N^2) \Big) \\[6pt]
&= \mathbb{P} \Big( (n_* + \chi_{1,\alpha}^2) \theta_N^2 - (2 n_* \hat{\theta}_n + \chi_{1,\alpha}^2) \theta_N + n_* \hat{\theta}_n^2 \leqslant 0 \Big) \\[6pt]
&= \mathbb{P} \Bigg( \theta_N^2 - 2 \cdot\frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \cdot \theta_N + \frac{n_* \hat{\theta}_n^2}{n_* + \chi_{1,\alpha}^2} \leqslant 0 \Bigg) \\[6pt]
&= \mathbb{P} \Bigg( \bigg( \theta_N - \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \bigg)^2 \leqslant \frac{\chi_{1,\alpha}^2 (n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2)}{(n_* + \chi_{1,\alpha}^2)^2} \Bigg) \\[6pt]
&= \mathbb{P} \Bigg( \theta_N \in \Bigg[ \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_* + \chi_{1,\alpha}^2} \cdot \sqrt{n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg] \Bigg), \\[6pt]
\end{align}$$
and substitution of the observed sample proportion (for simplicity I will use the same notation for this value) then leads to the Wilson score interval:
$$\text{CI}_N(1-\alpha) = \Bigg[ \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_* + \chi_{1,\alpha}^2} \cdot \sqrt{n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$
As can be seen, the finite-population correction in the Wilson score interval consists of replacing the sample size $n$ with the effective sample size $n_*$. This is an extremely simple adjustment, and it gives a confidence interval that is valid for any population size $N \geqslant n$ (finite or infinite). It is trivial to confirm that this reduces down to the standard Wilson score interval when $N = \infty$ (giving $n_* = n$). It is also useful to note that when you take a full census of the population with $N=n$ you get $n_* \rightarrow \infty$ and the confidence interval reduces to the single point $\text{CI}_n(1-\alpha) = [\hat{\theta}_n]$, just as we would expect.
CI for the unsampled proportion in a finite population: A useful variant of this problem occurs when you are interested in making an inference about the proportion of positive outcomes in the unsampled part of a finite population, which is $\theta_{n:N} \equiv \sum_{i=n+1}^N X_i/(N-n)$. In this case, we can use the pivotal quantity:
$$n_{**} \cdot \frac{(\hat{\theta}_n-\theta_N)^2}{\theta_N (1-\theta_N)} \overset{\text{Approx}}{\sim} \text{ChiSq}(1)
\quad \quad \quad n_{**} \equiv n \cdot \frac{N-n}{N-1},$$
which uses the alternative value $n_{**}$ for the effective sample size. The remaining calculations are the same as above, except that $n_{**}$ replaces $n_{*}$ in the formulae, giving the confidence interval:
$$\text{CI}_{n:N}(1-\alpha) = \Bigg[ \frac{n_{**} \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_{**} + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_{**} + \chi_{1,\alpha}^2} \cdot \sqrt{n_{**} \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$
It is again trivial to confirm that this reduces down to the standard Wilson score interval when $N = \infty$ (giving $n_{**} = n$). It is also useful to note that when you take a full census of the population with $N=n$ you get $n_{**} = 0$ and the confidence interval reduces to the vacuous interval $\text{CI}_{n:n}(1-\alpha) = [0,1]$, just as we would expect.
Implementation in R: These generalised confidence intervals are implemented in the CONF.prop
function in the stat.extend
package. The function implements the Wilson score intervals in the form shown above. By default the function uses the standard interval for the proportion parameter for an infinite population. However, the function also allows specification of a population size N
and a logical value unsampled
to give the above interval forms.
Best Answer
There are a few things beyond simple algebra that go into the derivation:
The linear confidence interval you list relies on the fact that the sampling distribution for the estimator $\hat{S}(t)$ is asymptotically normal (so your confidence interval really should be produced from a data set of reasonable size). This, in turn, can be shown by showing that $\hat{S}(t)$ is a maximum likelihood estimator, and it is a general fact that MLEs are asymptotically normally distributed.
Then you must know the invariance property of MLEs, which tells you that $ln(\hat{S}(x))$ is also an MLE. It is therefore asymptotically normally distributed, and its expected value is $ln(S(x))$. Thus,
$$ \frac{\ln (\hat{S}(x))}{\ln (S(x))} $$
is approximately normal with mean 1.
Now, as you say, we take another logarithm and use the invariance property again to determine that
$$ \ln \left( \frac{\ln (\hat{S}(x))}{\ln (S(x))} \right) $$
is approximately normal with mean 0.
The construction of the log-transformed confidence interval now follows by simple algebra if you can determine the variance of the above estimator. This is probably nigh on impossible to do exactly, but once more we can get an approximate value, this time using the delta-method.