Solved – How to calculate 2D standard deviation, with 0 mean, bounded by limits

normal distributionstandard deviation

My problem is as follows:
I drop 40 balls at once from a certain point, a few meters over the floor.
The balls roll, and comes to a rest.
Using computer vision, I calculate the center of mass in the X-Y plane.
I am only interested in the distance from the center of mass to each ball, which is calculated using simple geometry.
Now, I want to know the one-sided standard deviation from the center.
So, I would be able to know that a certain number of balls are within one std radius, more balls within 2*std radius and so on.
How do I calculate the one-sided standard deviation? A normal approach would state that half of the balls be on the "negative side" of 0 mean. This of course makes no sense in this experiment.
Do I have to make sure that the balls conform to the standard distribution?
Thank you for any help.

Best Answer

To characterize the amount of 2D dispersion around the centroid, you just want the (root) mean squared distance,

$$\hat\sigma=\text{RMS} = \sqrt{\frac{1}{n}\sum_i\left((x_i - \bar{x})^2 + (y_i - \bar{y})^2\right)}.$$

In this formula, $(x_i, y_i), i=1, 2, \ldots, n$ are the point coordinates and their centroid (point of averages) is $(\bar{x}, \bar{y}).$


The question asks for the distribution of the distances. When the balls have an isotropic bivariate Normal distribution around their centroid--which is a standard and physically reasonable assumption--the squared distance is proportional to a chi-squared distribution with two degrees of freedom (one for each coordinate). This is a direct consequence of one definition of the chi-squared distribution as a sum of squares of independent standard normal variables, because $$x_i - \bar{x} = \frac{n-1}{n}x_i - \sum_{j\ne i}\frac{1}{n}x_j$$ is a linear combination of independent normal variates with expectation $$\mathbb{E}[x_i - \bar{x}] = \frac{n-1}{n}\mathbb{E}[x_i] -\sum_{j\ne i}\frac{1}{n}\mathbb{E}[x_j] = 0.$$ Writing the common variance of the $x_i$ as $\sigma^2$, $$\mathbb{E}[\left(x_i -\bar{x}\right)^2]=\text{Var}(x_i - \bar{x}) = \left(\frac{n-1}{n}\right)^2\text{Var}(x_i) + \sum_{j\ne i}\left(\frac{1}{n}\right)^2\text{Var}(x_j) = \frac{n-1}{n}\sigma^2.$$ The assumption of anisotropy is that the $y_j$ have the same distribution as the $x_i$ and are independent of them, so an identical result holds for the distribution of $(y_j - \bar{y})^2$. This establishes the constant of proportionality: the squares of the distances have a chi-squared distribution with two degrees of freedom, scaled by $\frac{n-1}{n}\sigma^2$.

The most severe test of these equations is the case $n=2$, for then the fraction $\frac{n-1}{n}$ differs the most from $1$. By simulating the experiment, both for $n=2$ and $n=40$, and overplotting the histograms of squared distances with the scaled chi-squared distributions (in red), we can verify this theory.

Figure

Each row shows the same data: on the left the x-axis is logarithmic; on the right it shows the actual squared distance. The true value of $\sigma$ for these simulations was set to $1$.

These results are for 100,000 iterations with $n=2$ and 50,000 iterations with $n=40$. The agreements between the histograms and chi-squared densities are excellent.


Although $\sigma^2$ is unknown, it can be estimated in various ways. For instance, the mean squared distance should be $\frac{n-1}{n}\sigma^2$ times the mean of $\chi^2_2$, which is $2$. With $n=40$, for example, estimate $\sigma^2$ as $\frac{40}{39}/2$ times the mean squared distance. Thus an estimate of $\sigma$ would be $\sqrt{40/78}$ times the RMS distance. Using values of the $\chi^2_2$ distribution we can then say that:

  • Approximately 39% of the distances will be less than $\sqrt{39/40}\hat\sigma$, because 39% of a $\chi^2_2$ distribution is less than $1$.

  • Approximately 78% of the distances will be less than $\sqrt{3}$ times $\sqrt{39/40}\hat\sigma$, because 78% of a $\chi^2_2$ distribution is less than $3$.

And so on, for any multiple you care to use in place of $1$ or $3$. As a check, in the simulations for $n=40$ plotted previously, the actual proportions of squared distances less than $1, 2, \ldots, 10$ times $\frac{n-1}{n}\hat\sigma^2$ were

0.3932 0.6320 0.7767 0.8647 0.9178 0.9504 0.9700 0.9818 0.9890 0.9933

The theoretical proportions are

0.3935 0.6321 0.7769 0.8647 0.9179 0.9502 0.9698 0.9817 0.9889 0.9933

The agreement is excellent.


Here is R code to conduct and analyze the simulations.

f <- function(n, n.iter, x.min=0, x.max=Inf, plot=TRUE) {
  #
  # Generate `n.iter` experiments in which `n` locations are generated using
  # standard normal variates for their coordinates.
  #
  xy <- array(rnorm(n*2*n.iter), c(n.iter,2,n))
  #
  # Compute the squared distances to the centers for each experiment.
  #
  xy.center <- apply(xy, c(1,2), mean)
  xy.distances2 <- apply(xy-array(xy.center, c(n.iter,2,n)), c(1,3), 
                         function(z) sum(z^2))
  #
  # Optionally plot histograms.
  #
  if(plot) {
    xy.plot <- xy.distances2[xy.distances2 >= x.min & xy.distances2 <= x.max]

    hist(log(xy.plot), prob=TRUE, breaks=30,
         main=paste("Histogram of log squared distance, n=", n),
         xlab="Log squared distance")
    curve(dchisq(n/(n-1) * exp(x), df=2) * exp(x) * n/(n-1), 
          from=log(min(xy.plot)), to=log(max(xy.plot)), 
          n=513, add=TRUE, col="Red", lwd=2)

    hist(xy.plot, prob=TRUE, breaks=30,
         main=paste("Histogram of squared distance, n=", n),
         xlab="Squared distance")
    curve(n/(n-1) * dchisq(n/(n-1) * x, df=2), 
          from=min(xy.plot), to=max(xy.plot), 
          n=513, add=TRUE, col="Red", lwd=2)  
  }
  return(xy.distances2)
}
#
# Plot the histograms and compare to scaled chi-squared distributions.
#
par(mfrow=c(2,2))
set.seed(17)
xy.distances2 <- f(2, 10^5, exp(-6), 6)
xy.distances2 <- f(n <- 40, n.iter <- 50000, exp(-6), 12)
#
# Compare the last simulation to cumulative chi-squared distributions.
#
sigma.hat <- sqrt((n / (2*(n-1)) * mean(xy.distances2)))
print(cumsum(tabulate(cut(xy.distances2, 
                    (0:10) * (n-1)/n * sigma.hat^2))) / (n*n.iter), digits=4)
print(pchisq(1:10, df=2), digits=4)