To characterize the amount of 2D dispersion around the centroid, you just want the (root) mean squared distance,
$$\hat\sigma=\text{RMS} = \sqrt{\frac{1}{n}\sum_i\left((x_i - \bar{x})^2 + (y_i - \bar{y})^2\right)}.$$
In this formula, $(x_i, y_i), i=1, 2, \ldots, n$ are the point coordinates and their centroid (point of averages) is $(\bar{x}, \bar{y}).$
The question asks for the distribution of the distances. When the balls have an isotropic bivariate Normal distribution around their centroid--which is a standard and physically reasonable assumption--the squared distance is proportional to a chi-squared distribution with two degrees of freedom (one for each coordinate). This is a direct consequence of one definition of the chi-squared distribution as a sum of squares of independent standard normal variables, because $$x_i - \bar{x} = \frac{n-1}{n}x_i - \sum_{j\ne i}\frac{1}{n}x_j$$ is a linear combination of independent normal variates with expectation $$\mathbb{E}[x_i - \bar{x}] = \frac{n-1}{n}\mathbb{E}[x_i] -\sum_{j\ne i}\frac{1}{n}\mathbb{E}[x_j] = 0.$$ Writing the common variance of the $x_i$ as $\sigma^2$, $$\mathbb{E}[\left(x_i -\bar{x}\right)^2]=\text{Var}(x_i - \bar{x}) = \left(\frac{n-1}{n}\right)^2\text{Var}(x_i) + \sum_{j\ne i}\left(\frac{1}{n}\right)^2\text{Var}(x_j) = \frac{n-1}{n}\sigma^2.$$ The assumption of anisotropy is that the $y_j$ have the same distribution as the $x_i$ and are independent of them, so an identical result holds for the distribution of $(y_j - \bar{y})^2$. This establishes the constant of proportionality: the squares of the distances have a chi-squared distribution with two degrees of freedom, scaled by $\frac{n-1}{n}\sigma^2$.
The most severe test of these equations is the case $n=2$, for then the fraction $\frac{n-1}{n}$ differs the most from $1$. By simulating the experiment, both for $n=2$ and $n=40$, and overplotting the histograms of squared distances with the scaled chi-squared distributions (in red), we can verify this theory.
Each row shows the same data: on the left the x-axis is logarithmic; on the right it shows the actual squared distance. The true value of $\sigma$ for these simulations was set to $1$.
These results are for 100,000 iterations with $n=2$ and 50,000 iterations with $n=40$. The agreements between the histograms and chi-squared densities are excellent.
Although $\sigma^2$ is unknown, it can be estimated in various ways. For instance, the mean squared distance should be $\frac{n-1}{n}\sigma^2$ times the mean of $\chi^2_2$, which is $2$. With $n=40$, for example, estimate $\sigma^2$ as $\frac{40}{39}/2$ times the mean squared distance. Thus an estimate of $\sigma$ would be $\sqrt{40/78}$ times the RMS distance. Using values of the $\chi^2_2$ distribution we can then say that:
Approximately 39% of the distances will be less than $\sqrt{39/40}\hat\sigma$, because 39% of a $\chi^2_2$ distribution is less than $1$.
Approximately 78% of the distances will be less than $\sqrt{3}$ times $\sqrt{39/40}\hat\sigma$, because 78% of a $\chi^2_2$ distribution is less than $3$.
And so on, for any multiple you care to use in place of $1$ or $3$. As a check, in the simulations for $n=40$ plotted previously, the actual proportions of squared distances less than $1, 2, \ldots, 10$ times $\frac{n-1}{n}\hat\sigma^2$ were
0.3932 0.6320 0.7767 0.8647 0.9178 0.9504 0.9700 0.9818 0.9890 0.9933
The theoretical proportions are
0.3935 0.6321 0.7769 0.8647 0.9179 0.9502 0.9698 0.9817 0.9889 0.9933
The agreement is excellent.
Here is R
code to conduct and analyze the simulations.
f <- function(n, n.iter, x.min=0, x.max=Inf, plot=TRUE) {
#
# Generate `n.iter` experiments in which `n` locations are generated using
# standard normal variates for their coordinates.
#
xy <- array(rnorm(n*2*n.iter), c(n.iter,2,n))
#
# Compute the squared distances to the centers for each experiment.
#
xy.center <- apply(xy, c(1,2), mean)
xy.distances2 <- apply(xy-array(xy.center, c(n.iter,2,n)), c(1,3),
function(z) sum(z^2))
#
# Optionally plot histograms.
#
if(plot) {
xy.plot <- xy.distances2[xy.distances2 >= x.min & xy.distances2 <= x.max]
hist(log(xy.plot), prob=TRUE, breaks=30,
main=paste("Histogram of log squared distance, n=", n),
xlab="Log squared distance")
curve(dchisq(n/(n-1) * exp(x), df=2) * exp(x) * n/(n-1),
from=log(min(xy.plot)), to=log(max(xy.plot)),
n=513, add=TRUE, col="Red", lwd=2)
hist(xy.plot, prob=TRUE, breaks=30,
main=paste("Histogram of squared distance, n=", n),
xlab="Squared distance")
curve(n/(n-1) * dchisq(n/(n-1) * x, df=2),
from=min(xy.plot), to=max(xy.plot),
n=513, add=TRUE, col="Red", lwd=2)
}
return(xy.distances2)
}
#
# Plot the histograms and compare to scaled chi-squared distributions.
#
par(mfrow=c(2,2))
set.seed(17)
xy.distances2 <- f(2, 10^5, exp(-6), 6)
xy.distances2 <- f(n <- 40, n.iter <- 50000, exp(-6), 12)
#
# Compare the last simulation to cumulative chi-squared distributions.
#
sigma.hat <- sqrt((n / (2*(n-1)) * mean(xy.distances2)))
print(cumsum(tabulate(cut(xy.distances2,
(0:10) * (n-1)/n * sigma.hat^2))) / (n*n.iter), digits=4)
print(pchisq(1:10, df=2), digits=4)
Best Answer
If your numbers can only be positive, then modeling them as a normal distribution may not be desirable depending on your use case, because the normal distribution is supported on all real numbers.
Perhaps you would want to model height as an exponential distribution, or maybe a truncated normal distribution?
EDIT: After seeing your data, it really looks like it might fit an exponential distribution well! You could estimate the $ \lambda $ parameter by taking, for example, a maximum likelihood approach.