Solved – confidence interval and a single observation

confidence interval

There is a single observation of a random variable $Z$, we know that this random variable comes from a normal distribution, we also have an estimate of the variance of this normal distribution, $s_Z{^2}$.

The question is whether with the information we have we can place margins of error, say $\pm3s_Z$, on our single observation and speculate that the population mean will be captured in 99 percents of the chances when observing $Z$.

This must be a very simple question, however, as I strongly associate confidence intervals with sampling distributions and not single observations, I am not sure about the answer. I would greatly appreciate any feedback. Thank you

Some clarifying details to the question:

There is a parameter $Z=\frac{X}{X+Y}$, where $X=\Sigma_{i=1}^{N}x_i$ and $Y=\Sigma_{i=1}^{N}y_i$. Due to time and cost constrains field measurements of $Z$ are usually limited to a single estimate. Number of $x$ and $y$, $N$, may vary from 15 to 200 depending on a field survey. As we are not able to record the entire population of $x$ and $y$, our estimate $Z$ would differ from the true value, $\mu_Z$ (population mean). In general, the true value can be obtained either by sampling the entire population of $x$ and $y$, or by sampling $Z$ 'infinite' amount of times for a fixed number, $N$, of $x$ and $y$ (which would basically result in sampling error approaching zero). Taking into account the field survey constrains I would like to find a way to determine an interval based on a single estimate of $Z$, which would bracket the true value, and develop a strategy for field surveys, so that a single estimate of $Z$ would give some valuable insights on the true value. I'm trying to do analytical analysis first and then verify the analytical solution against simulations performed on objects similar to the ones investigated in the field.

If we consider a case when we are able to obtain an 'infinitely' large sample of $Z$s for a fixed $N$ of $x$ and $y$, variability of the output parameter $Z$ would depend on the variability of the input quantities: $X$ and $Y$, where $X=\Sigma_{i=1}^{N}x_i$ and $Y=\Sigma_{i=1}^{N}y_i$. We would expect the variability of $X$ and $Y$ contribute to the variability of $Z$ in the following manner:
$\sigma_Z=\sqrt{(\frac{\partial{Z}}{\partial{X}})^2\sigma_X^2+(\frac{\partial{Z}}{\partial{Y}})^2\sigma_Y^2}$ ($X$ and $Y$ are not correlated) or $\sigma_Z=\sqrt{\frac{\mu_{\Sigma{x}}^2}{(\mu_{\Sigma{x}}+\mu_{\Sigma{y}})^4}\sigma_{\Sigma{x}}^2+\frac{\mu_{\Sigma{y}}^2}{(\mu_{\Sigma{x}}+\mu_{\Sigma{y}})^4}\sigma_{\Sigma{y}}^2}$.

In accordance with the Central Limit Theorem, sum of $N$ identically distributed random variables with mean $\mu$ and variance $\sigma^2$ would converge to a normal distribution with mean $N\mu$ and variance $N\sigma^2$, if $N$ is large enough. Thus, if sufficient number of $x$ and $y$ is recorded, $\mu_{\Sigma{x}}=N\mu_{x}$, $\mu_{\Sigma{y}}=N\mu_{y}$ and $\sigma_{\Sigma{x}}^2=N\sigma_{x}^2$, $\sigma_{\Sigma{y}}^2=N\sigma_{y}^2$, which leads to the following changes in the equation for the standard deviation of $Z$:
$$\sigma_Z=\sqrt{\frac{\overline{y}^{2}s_{x}^2+\overline{x}^{2}s_{y}^2}{N}}\frac{1}{(\overline{x}+\overline{y})^2}, (1)$$ where $\overline{x}$, $\overline{y}$,$s_{x}$,$s_{y}$ are the best estimates of $\mu_{x}$, $\mu_{y}$,$\sigma_{x}$,$\sigma_{y}$.

However, even if the above assumptions are correct, $\sigma_Z$ is not enough for determining a confidence interval, we need to know the probability distribution of $Z$. We assume that the probability distribution of $Z$ is approximately normal. This is a weak assumption, however, some studies have shown that when a ratio of two normally distributed correlated variables is considered ($X$ and $X+Y$ are indeed correlated), then the distribution of their ratio is approximately normal, when coefficient of variation of the denominator is negligible. Coefficient of variation of the denominator in case of $Z$ is equal to $CV_d=\frac{\sqrt{\sigma_x^2+\sigma_y^2}}{\sqrt{N}(\mu_x+\mu_y)}$, which gets smaller as $N$ gets larger. To sum up, we assume that, when $N$ is large enough, $\sigma_Z$ can be calculated by Eq.(1) even in case of a single estimate of $Z$, and margins of error can be placed on this estimate assuming that the distribution of $Z$ is normal.

Now we perform simulations. Firstly, 12 objects are generated, these objects have properties similar to the objects investigated in the field. Then we measure the population parameter $\mu_Z$ for each object, this is done based on $\sim$ 150000 values of $x$ and $y$. Then we obtain 200 estimates of $Z$ for a fixed number of $x$ and $y$: $N$=10,20,50,100,300 and 500. The figure below displays boxplots of the error of each estimate, $Z$, in relation to the population mean, $\mu_Z$. An obvious conclusion from this figure is that variation in $Z$ estimates around the true value decreases as $N$ increases.

What about confidence interval?
QQplots of 200 estimates of $Z$ for each case (each object and each $N$) demonstrate somewhat normal behavior, sometimes with little skewness or tail tendencies. Interestingly, when margins of error $\pm1.96\sigma_{Z}^{200}$, where $\sigma_{Z}^{200}$ is the standard deviation of 200 estimates of $Z$ for each case, are added to each estimate of $Z$, the mean of 200 estimates of $Z$ and $\mu_Z$ are covered in 95% of the cases, when $N=500$, for 8 objects out of 12 objects, while for 4 objects they are covered in 94% of the cases, which, in my opinion, can be interpreted as a sign that $Z$ can be approximated by a normal distribution, when $N$=500.

Adding margins of error equal to $\pm1.96\sigma_Z$, where $\sigma_Z$ is calculated by Eq.(1), to each of 200 estimates of $Z$ results in the following. When $N$=500, the confidence interval brackets the mean of 200 estimates and the population mean in $\sim$ 93% of the cases. Firstly, I thought that the underprediction is related to the fact that we use $\overline{x}$, $\overline{y}$,$s_{x}$,$s_{y}$ instead of $\mu_{x}$, $\mu_{y}$,$\sigma_{x}$,$\sigma_{y}$ in Eq.(1). However, when I plug population parameters in Eq.(1): $\mu_x$, $\mu_y$, $\sigma_x$ and $\sigma_y$, which are obtained based on $\sim$ 150000 values of $x$ and $y$, and provide each estimate of $Z$ with margins of error equal to $\pm1.96\sigma_{Z}^{pop}$, the result is not much different from the case when $\pm1.96\sigma_Z$ is calculated for $N$=500, i.e. the interval includes the mean of 200 estimates of $Z$ and the population mean, $\mu_Z$, in $\sim$ 92.5% of the cases.

Based on the above, I think that assumption about $Z$ being normally distributed may be too strong for my data. But I don't know where to go from this dead end now, what should I use instead of Z-scores if $Z$s are somewhat normally distributed, but there are no df in this case. Any help will be extremely appreciated (and any comments on the analysis in general). Thank you in advance

Best Answer

The formula for a confidence interval for the mean, $\mu$, of normally distributed population is:

$$\bar{x} \pm z_{\frac{\alpha}{2}}\frac{s_Z}{\sqrt{n}}$$

where $s_Z$ is the estimate of your population standard deviation $(\sqrt{s_Z{^2}})$.

In your case, your sample is just one data point ($n= 1$) and $\alpha = 0.01$ for a $99\%$ confidence interval.

So, the $99\%$ confidence interval becomes

$$\bar{x} \pm -2.326348*s_Z$$

This is theoretical. In practice, your estimated variance, $s_Z{^2}$, comes from your data. Since you only have one observation, $s_Z{^2}$ would be undefined.

Related Solutions

Confidence Interval – Calculating Confidence Interval for Distance from Center in Bivariate Data

Look at $\chi$ distribution, it's a square root of $\chi^2$ distribution, which is in turn a sum of squared normals.

CORRECTION: $R(n)^2=\sum_{i=1}^n x_i+\sum_{i=1}^n y_i = \sum_{i=1}^{2n}z_i$, where $z_i=x_i$ for $i=(1,n)$ and $z_i=y_{i-n}$ for $i=(1+n,2n)$.

Hence, $R(n) = \sigma r(2n)$, where $r(k)\sim \chi(k)$.

$E[r(2n)]=\mu$, where $\mu=\sqrt{2}\Gamma((2n+1)/2)/\Gamma(n)$ The variance $Var[r(2n)]=(2n-\mu^2)$, see $\chi$ distribution. Subsequently, $E[R(n)]=\sigma E[r(2n)]$ and so on.

UPDATE: the CDF is given by the regularized gamma function: $P(n,r(2n)^2/(2))$. To compute the confidence bounds CB you have to solve for CB in $P(2n,CB^2/2)=\alpha$, where $\alpha$ is the confidence, such as 5% or 95%. CB will be in units of $\sigma$. Your math library should have the regularized gamma function, if it doesn't have its inverse then use the solver to find the CBs.

RESTATED problem

I think that it's best to redefine the $R(n)=\frac{1}{n}\sum_{i=1}^n r_i=\frac{1}{n}\sum_{i=1}^n\sqrt{X_i^2+Y_i^2}$. This means that you compute the distance $r_i$ for each pair of $(X_i,Y_i)$ coordinates, then average it acorss $n$ observation to get $R(n)$. Now, it's clear that $r_i^2\sim\chi^2_2$, assuming that X and Y are standardized normals, while $r_i\sim\chi_2$, i.e. chi distribution with 2 degrees of freedom. I gave the links to this distribution, you should be able to work out the math for non-standard normals.

Next, $R(n)$ is the sum of $\chi_2$ distributed numbers, so CLT should be applicable. For n=30 CLT should work great. I would run Monte Carlo then test it with Jarque-Bera or similar tests of normality for smaller n. If it's normal enough, then do the CLT for R(n), while working with closed-forms of $r_i$.

Example: $(\mu_X,\mu_Y)=(0,0)$, $\sigma_X=\sigma_Y=1$, $\sigma_{X,Y}=0$.

$E[R(1)]=E[r_1]=\mu=\sqrt{2}\frac{\Gamma(\frac{2+1}{2})}{\Gamma(1)}=1.2533$

$Var[R(1)]=Var[r_1]=2\cdot 1-\mu^2= 0.4292$

You can test this with the following Matlab/Octave code:

    m=1e6 % number of samples
n=1 % number of X and Y to compute R
mu = 0*ones(2,1); % set ZERO means 
Sig = eye(2); % set unit variance
x = mvnrnd(mu,Sig,m*n)'; % generate X,Y pairs
x = permute(reshape(x,2,n,m),[3,2,1]); % X and Y in 3 dim matrix 

r = sqrt(sum(x.^2,3));
R = mean(r,2); % R(n)
hist(R) % show the histogram
jbtest(R) % test normality

[mean(R) std(R) var(R)]

Which outputs:

ans =

    1.2519    0.6551    0.4291

enter image description here

Now you can run the same for higher n=30, and get the output:

ans =

    1.2534    0.1197    0.0143

Applying the CLT approximation you get $Var[R(30)]_{CLT}=Var[R(1)]/30=0.0143$, very good match, and here's the histogram: enter image description here

Solved – Confidence interval for the standard deviation of a Normal distribution with known mean

After looking around for a while without finding anything satisfactory, this is the best answer that seems to make sense to me:

Notice that the sampling distribution of the unknown $\sigma$ is not Normal, so $Q = \frac{\bar{Y} - \mu_{0}}{\sigma / \sqrt{n}}$ does not actually follow $N(0, 1)$, thus it cannot be a pivotal quantity, at least not a pivotal quantity with a Normal distribution.

The case for unknown $\mu$ however, is different because the sampling distribution of $\mu$ is Normal by Central Limit Theorem, so we know $Q = \frac{\bar{Y} - \mu}{\sigma_{0} / \sqrt{n}}$ follows $N(0, 1)$, which makes it a pivotal quantity.

This is why to find the confidence interval for $\sigma$, we have to use the pivotal quantity $$\frac{(n-1)S^2}{\sigma^2},$$ which follows a $\chi^2$ distribution with $n-1$ degrees of freedom.

Best Answer

Related Solutions

Confidence Interval – Calculating Confidence Interval for Distance from Center in Bivariate Data

Solved – Confidence interval for the standard deviation of a Normal distribution with known mean

Related Question