Confidence Interval – Calculating Confidence Interval for Distance from Center in Bivariate Data

bivariateconfidence intervaldistancenormal distribution

We have a bivariate normal process where $X \sim N(\mu_x, \sigma), \, Y \sim N(\mu_y, \sigma)$, with no covariance.

$(\mu_x, \mu_y)$ are unknown.

(For convenience we can assert that $\sigma = 1$, or that we have a good estimate for its value.)

We are trying to characterize the distance between our sample center and the true center $(\mu_x, \mu_y)$ as a function of shots sampled n.

Because we don't care about the location of the true center, only our distance from it, we assert that $\mu_x = \mu_y = 0$ and look at the random variable $R(n) = \sqrt{\overline{x_i}^2 + \overline{y_i}^2}$ — the distance between sample center and true center.

Question: How can we characterize the confidence interval of R(n)?

Note that $R(n) \ge 0$ and $E[R(n)] \to 0$ as $n \to \infty$

I have Monte Carlo estimates of both the mean and standard deviation of R(n) for small n.

I want to calculate confidence levels and intervals for R(n). I.e., given n and confidence level 90% what is the confidence interval of a sample R(n) about its population mean?

I don't believe this is amenable to CLT analysis because the values are bounded at 0.

I suppose I could Monte Carlo the edf since I'm only interested in $n \in [2, 30]$, and the edf must scale with $\sigma$ or $\sigma^2$. But first I want to make sure I'm not missing something obvious or a known closed-form expression.

Best Answer

Look at $\chi$ distribution, it's a square root of $\chi^2$ distribution, which is in turn a sum of squared normals.

CORRECTION: $R(n)^2=\sum_{i=1}^n x_i+\sum_{i=1}^n y_i = \sum_{i=1}^{2n}z_i$, where $z_i=x_i$ for $i=(1,n)$ and $z_i=y_{i-n}$ for $i=(1+n,2n)$.

Hence, $R(n) = \sigma r(2n)$, where $r(k)\sim \chi(k)$.

$E[r(2n)]=\mu$, where $\mu=\sqrt{2}\Gamma((2n+1)/2)/\Gamma(n)$ The variance $Var[r(2n)]=(2n-\mu^2)$, see $\chi$ distribution. Subsequently, $E[R(n)]=\sigma E[r(2n)]$ and so on.

UPDATE: the CDF is given by the regularized gamma function: $P(n,r(2n)^2/(2))$. To compute the confidence bounds CB you have to solve for CB in $P(2n,CB^2/2)=\alpha$, where $\alpha$ is the confidence, such as 5% or 95%. CB will be in units of $\sigma$. Your math library should have the regularized gamma function, if it doesn't have its inverse then use the solver to find the CBs.

RESTATED problem

I think that it's best to redefine the $R(n)=\frac{1}{n}\sum_{i=1}^n r_i=\frac{1}{n}\sum_{i=1}^n\sqrt{X_i^2+Y_i^2}$. This means that you compute the distance $r_i$ for each pair of $(X_i,Y_i)$ coordinates, then average it acorss $n$ observation to get $R(n)$. Now, it's clear that $r_i^2\sim\chi^2_2$, assuming that X and Y are standardized normals, while $r_i\sim\chi_2$, i.e. chi distribution with 2 degrees of freedom. I gave the links to this distribution, you should be able to work out the math for non-standard normals.

Next, $R(n)$ is the sum of $\chi_2$ distributed numbers, so CLT should be applicable. For n=30 CLT should work great. I would run Monte Carlo then test it with Jarque-Bera or similar tests of normality for smaller n. If it's normal enough, then do the CLT for R(n), while working with closed-forms of $r_i$.

Example: $(\mu_X,\mu_Y)=(0,0)$, $\sigma_X=\sigma_Y=1$, $\sigma_{X,Y}=0$.

$E[R(1)]=E[r_1]=\mu=\sqrt{2}\frac{\Gamma(\frac{2+1}{2})}{\Gamma(1)}=1.2533$

$Var[R(1)]=Var[r_1]=2\cdot 1-\mu^2= 0.4292$

You can test this with the following Matlab/Octave code:

    m=1e6 % number of samples
n=1 % number of X and Y to compute R
mu = 0*ones(2,1); % set ZERO means 
Sig = eye(2); % set unit variance
x = mvnrnd(mu,Sig,m*n)'; % generate X,Y pairs
x = permute(reshape(x,2,n,m),[3,2,1]); % X and Y in 3 dim matrix 

r = sqrt(sum(x.^2,3));
R = mean(r,2); % R(n)
hist(R) % show the histogram
jbtest(R) % test normality

[mean(R) std(R) var(R)]

Which outputs:

ans =

    1.2519    0.6551    0.4291

enter image description here

Now you can run the same for higher n=30, and get the output:

ans =

    1.2534    0.1197    0.0143

Applying the CLT approximation you get $Var[R(30)]_{CLT}=Var[R(1)]/30=0.0143$, very good match, and here's the histogram: enter image description here

Related Question