Solved – confidence interval and a single observation

confidence interval

There is a single observation of a random variable $Z$, we know that this random variable comes from a normal distribution, we also have an estimate of the variance of this normal distribution, $s_Z{^2}$.

The question is whether with the information we have we can place margins of error, say $\pm3s_Z$, on our single observation and speculate that the population mean will be captured in 99 percents of the chances when observing $Z$.

This must be a very simple question, however, as I strongly associate confidence intervals with sampling distributions and not single observations, I am not sure about the answer. I would greatly appreciate any feedback. Thank you

Some clarifying details to the question:

There is a parameter $Z=\frac{X}{X+Y}$, where $X=\Sigma_{i=1}^{N}x_i$ and $Y=\Sigma_{i=1}^{N}y_i$. Due to time and cost constrains field measurements of $Z$ are usually limited to a single estimate. Number of $x$ and $y$, $N$, may vary from 15 to 200 depending on a field survey. As we are not able to record the entire population of $x$ and $y$, our estimate $Z$ would differ from the true value, $\mu_Z$ (population mean). In general, the true value can be obtained either by sampling the entire population of $x$ and $y$, or by sampling $Z$ 'infinite' amount of times for a fixed number, $N$, of $x$ and $y$ (which would basically result in sampling error approaching zero). Taking into account the field survey constrains I would like to find a way to determine an interval based on a single estimate of $Z$, which would bracket the true value, and develop a strategy for field surveys, so that a single estimate of $Z$ would give some valuable insights on the true value. I'm trying to do analytical analysis first and then verify the analytical solution against simulations performed on objects similar to the ones investigated in the field.

If we consider a case when we are able to obtain an 'infinitely' large sample of $Z$s for a fixed $N$ of $x$ and $y$, variability of the output parameter $Z$ would depend on the variability of the input quantities: $X$ and $Y$, where $X=\Sigma_{i=1}^{N}x_i$ and $Y=\Sigma_{i=1}^{N}y_i$. We would expect the variability of $X$ and $Y$ contribute to the variability of $Z$ in the following manner:
$\sigma_Z=\sqrt{(\frac{\partial{Z}}{\partial{X}})^2\sigma_X^2+(\frac{\partial{Z}}{\partial{Y}})^2\sigma_Y^2}$ ($X$ and $Y$ are not correlated) or $\sigma_Z=\sqrt{\frac{\mu_{\Sigma{x}}^2}{(\mu_{\Sigma{x}}+\mu_{\Sigma{y}})^4}\sigma_{\Sigma{x}}^2+\frac{\mu_{\Sigma{y}}^2}{(\mu_{\Sigma{x}}+\mu_{\Sigma{y}})^4}\sigma_{\Sigma{y}}^2}$.

In accordance with the Central Limit Theorem, sum of $N$ identically distributed random variables with mean $\mu$ and variance $\sigma^2$ would converge to a normal distribution with mean $N\mu$ and variance $N\sigma^2$, if $N$ is large enough. Thus, if sufficient number of $x$ and $y$ is recorded, $\mu_{\Sigma{x}}=N\mu_{x}$, $\mu_{\Sigma{y}}=N\mu_{y}$ and $\sigma_{\Sigma{x}}^2=N\sigma_{x}^2$, $\sigma_{\Sigma{y}}^2=N\sigma_{y}^2$, which leads to the following changes in the equation for the standard deviation of $Z$:
$$\sigma_Z=\sqrt{\frac{\overline{y}^{2}s_{x}^2+\overline{x}^{2}s_{y}^2}{N}}\frac{1}{(\overline{x}+\overline{y})^2}, (1)$$ where $\overline{x}$, $\overline{y}$,$s_{x}$,$s_{y}$ are the best estimates of $\mu_{x}$, $\mu_{y}$,$\sigma_{x}$,$\sigma_{y}$.

However, even if the above assumptions are correct, $\sigma_Z$ is not enough for determining a confidence interval, we need to know the probability distribution of $Z$. We assume that the probability distribution of $Z$ is approximately normal. This is a weak assumption, however, some studies have shown that when a ratio of two normally distributed correlated variables is considered ($X$ and $X+Y$ are indeed correlated), then the distribution of their ratio is approximately normal, when coefficient of variation of the denominator is negligible. Coefficient of variation of the denominator in case of $Z$ is equal to $CV_d=\frac{\sqrt{\sigma_x^2+\sigma_y^2}}{\sqrt{N}(\mu_x+\mu_y)}$, which gets smaller as $N$ gets larger. To sum up, we assume that, when $N$ is large enough, $\sigma_Z$ can be calculated by Eq.(1) even in case of a single estimate of $Z$, and margins of error can be placed on this estimate assuming that the distribution of $Z$ is normal.

Now we perform simulations. Firstly, 12 objects are generated, these objects have properties similar to the objects investigated in the field. Then we measure the population parameter $\mu_Z$ for each object, this is done based on $\sim$ 150000 values of $x$ and $y$. Then we obtain 200 estimates of $Z$ for a fixed number of $x$ and $y$: $N$=10,20,50,100,300 and 500. The figure below displays boxplots of the error of each estimate, $Z$, in relation to the population mean, $\mu_Z$. An obvious conclusion from this figure is that variation in $Z$ estimates around the true value decreases as $N$ increases.
enter image description here

What about confidence interval?
QQplots of 200 estimates of $Z$ for each case (each object and each $N$) demonstrate somewhat normal behavior, sometimes with little skewness or tail tendencies. Interestingly, when margins of error $\pm1.96\sigma_{Z}^{200}$, where $\sigma_{Z}^{200}$ is the standard deviation of 200 estimates of $Z$ for each case, are added to each estimate of $Z$, the mean of 200 estimates of $Z$ and $\mu_Z$ are covered in 95% of the cases, when $N=500$, for 8 objects out of 12 objects, while for 4 objects they are covered in 94% of the cases, which, in my opinion, can be interpreted as a sign that $Z$ can be approximated by a normal distribution, when $N$=500.

Adding margins of error equal to $\pm1.96\sigma_Z$, where $\sigma_Z$ is calculated by Eq.(1), to each of 200 estimates of $Z$ results in the following. When $N$=500, the confidence interval brackets the mean of 200 estimates and the population mean in $\sim$ 93% of the cases. Firstly, I thought that the underprediction is related to the fact that we use $\overline{x}$, $\overline{y}$,$s_{x}$,$s_{y}$ instead of $\mu_{x}$, $\mu_{y}$,$\sigma_{x}$,$\sigma_{y}$ in Eq.(1). However, when I plug population parameters in Eq.(1): $\mu_x$, $\mu_y$, $\sigma_x$ and $\sigma_y$, which are obtained based on $\sim$ 150000 values of $x$ and $y$, and provide each estimate of $Z$ with margins of error equal to $\pm1.96\sigma_{Z}^{pop}$, the result is not much different from the case when $\pm1.96\sigma_Z$ is calculated for $N$=500, i.e. the interval includes the mean of 200 estimates of $Z$ and the population mean, $\mu_Z$, in $\sim$ 92.5% of the cases.

Based on the above, I think that assumption about $Z$ being normally distributed may be too strong for my data. But I don't know where to go from this dead end now, what should I use instead of Z-scores if $Z$s are somewhat normally distributed, but there are no df in this case. Any help will be extremely appreciated (and any comments on the analysis in general). Thank you in advance

Best Answer

The formula for a confidence interval for the mean, $\mu$, of normally distributed population is:

$$\bar{x} \pm z_{\frac{\alpha}{2}}\frac{s_Z}{\sqrt{n}}$$

where $s_Z$ is the estimate of your population standard deviation $(\sqrt{s_Z{^2}})$.

In your case, your sample is just one data point ($n= 1$) and $\alpha = 0.01$ for a $99\%$ confidence interval.

So, the $99\%$ confidence interval becomes

$$\bar{x} \pm -2.326348*s_Z$$

This is theoretical. In practice, your estimated variance, $s_Z{^2}$, comes from your data. Since you only have one observation, $s_Z{^2}$ would be undefined.