Solved – Is standard deviation totally wrong? How to you calculate std for heights, counts and etc (positive numbers)

mathematical-statisticsnormal distributionprobabilitystandard deviation

Let's say I am calculating heights (in cm) and the numbers must be higher than zero.

Here is the sample list:

0.77132064
0.02075195
0.63364823
0.74880388
0.49850701
0.22479665
0.19806286
0.76053071
0.16911084
0.08833981

Mean: 0.41138725956196015
Std: 0.2860541519582141

In this example, according to the normal distribution, 99.7% of the values must be between ±3 times the standard deviation from the mean. However, even twice the standard deviation becomes negative:

-2 x std calculation = 0.41138725956196015 - 0.2860541519582141 x 2 = -0,160721044354468

However, my numbers must be positive. So they must be above 0. I can ignore negative numbers but I doubt this is the correct way to calculate probabilities using standard deviation.

Can someone help me to understand if I am using this in correct way? Or do I need to chose a different method?

Well to be honest, math is math. It doesn't matter if it is normal distribution or not. If it works with unsigned numbers, it should work with positive numbers as well! Am I wrong?

EDIT1: Added histogram

To be more clear, I have added my real data's histogram

EDIT2: Some values

Mean: 0.007041500928135767
Percentile 50: 0.0052000000000000934
Percentile 90: 0.015500000000000047
Std: 0.0063790857035425025
Var: 4.06873389299246e-05

Best Answer

If your numbers can only be positive, then modeling them as a normal distribution may not be desirable depending on your use case, because the normal distribution is supported on all real numbers.

Perhaps you would want to model height as an exponential distribution, or maybe a truncated normal distribution?

EDIT: After seeing your data, it really looks like it might fit an exponential distribution well! You could estimate the $ \lambda $ parameter by taking, for example, a maximum likelihood approach.

Related Solutions

Solved – Standard deviation of revenue where numbers are only positive

Standard deviation is kind of a unit of distance your numbers are from the mean. The problem with standard deviation is that when it's introduced it's usually done assuming a bell shaped curve. From that it's easy to see what 1 standard deviation means ( 68% of the data is within 1 standard deviation ). It sticks in your head easily. Problem is, when we observe counts, or other types of data values(like blood pressue, failure times, etc) their set of observed values dont follow a bell shaped curve. In those contexts the easy to understand idea you were given for the sd doesn't work as intuitively as you would hope. My best advice is to use other measures then a standard deviation. Learn about a concept called percentiles. For example the 25th percentile is the count for which 25% of your data falls beneath it. So if the 25th percentile was 8, then 25% of your data has 8 or less. In other words I'm telling you to find a different statistic to use to classify where your numbers fall, instead of using a standard deviation.

This was asked awhile ago and had some good answers as well: How to interpret two standard deviations below the mean of a count variable being less than zero?

Solved – How to calculate 2D standard deviation, with 0 mean, bounded by limits

To characterize the amount of 2D dispersion around the centroid, you just want the (root) mean squared distance,

$$\hat\sigma=\text{RMS} = \sqrt{\frac{1}{n}\sum_i\left((x_i - \bar{x})^2 + (y_i - \bar{y})^2\right)}.$$

In this formula, $(x_i, y_i), i=1, 2, \ldots, n$ are the point coordinates and their centroid (point of averages) is $(\bar{x}, \bar{y}).$

The question asks for the distribution of the distances. When the balls have an isotropic bivariate Normal distribution around their centroid--which is a standard and physically reasonable assumption--the squared distance is proportional to a chi-squared distribution with two degrees of freedom (one for each coordinate). This is a direct consequence of one definition of the chi-squared distribution as a sum of squares of independent standard normal variables, because $$x_i - \bar{x} = \frac{n-1}{n}x_i - \sum_{j\ne i}\frac{1}{n}x_j$$ is a linear combination of independent normal variates with expectation $$\mathbb{E}[x_i - \bar{x}] = \frac{n-1}{n}\mathbb{E}[x_i] -\sum_{j\ne i}\frac{1}{n}\mathbb{E}[x_j] = 0.$$ Writing the common variance of the $x_i$ as $\sigma^2$, $$\mathbb{E}[\left(x_i -\bar{x}\right)^2]=\text{Var}(x_i - \bar{x}) = \left(\frac{n-1}{n}\right)^2\text{Var}(x_i) + \sum_{j\ne i}\left(\frac{1}{n}\right)^2\text{Var}(x_j) = \frac{n-1}{n}\sigma^2.$$ The assumption of anisotropy is that the $y_j$ have the same distribution as the $x_i$ and are independent of them, so an identical result holds for the distribution of $(y_j - \bar{y})^2$. This establishes the constant of proportionality: the squares of the distances have a chi-squared distribution with two degrees of freedom, scaled by $\frac{n-1}{n}\sigma^2$.

The most severe test of these equations is the case $n=2$, for then the fraction $\frac{n-1}{n}$ differs the most from $1$. By simulating the experiment, both for $n=2$ and $n=40$, and overplotting the histograms of squared distances with the scaled chi-squared distributions (in red), we can verify this theory.

Each row shows the same data: on the left the x-axis is logarithmic; on the right it shows the actual squared distance. The true value of $\sigma$ for these simulations was set to $1$.

These results are for 100,000 iterations with $n=2$ and 50,000 iterations with $n=40$. The agreements between the histograms and chi-squared densities are excellent.

Although $\sigma^2$ is unknown, it can be estimated in various ways. For instance, the mean squared distance should be $\frac{n-1}{n}\sigma^2$ times the mean of $\chi^2_2$, which is $2$. With $n=40$, for example, estimate $\sigma^2$ as $\frac{40}{39}/2$ times the mean squared distance. Thus an estimate of $\sigma$ would be $\sqrt{40/78}$ times the RMS distance. Using values of the $\chi^2_2$ distribution we can then say that:

Approximately 39% of the distances will be less than $\sqrt{39/40}\hat\sigma$, because 39% of a $\chi^2_2$ distribution is less than $1$.
Approximately 78% of the distances will be less than $\sqrt{3}$ times $\sqrt{39/40}\hat\sigma$, because 78% of a $\chi^2_2$ distribution is less than $3$.

And so on, for any multiple you care to use in place of $1$ or $3$. As a check, in the simulations for $n=40$ plotted previously, the actual proportions of squared distances less than $1, 2, \ldots, 10$ times $\frac{n-1}{n}\hat\sigma^2$ were

0.3932 0.6320 0.7767 0.8647 0.9178 0.9504 0.9700 0.9818 0.9890 0.9933

The theoretical proportions are

0.3935 0.6321 0.7769 0.8647 0.9179 0.9502 0.9698 0.9817 0.9889 0.9933

The agreement is excellent.

Here is R code to conduct and analyze the simulations.

f <- function(n, n.iter, x.min=0, x.max=Inf, plot=TRUE) {
  #
  # Generate `n.iter` experiments in which `n` locations are generated using
  # standard normal variates for their coordinates.
  #
  xy <- array(rnorm(n*2*n.iter), c(n.iter,2,n))
  #
  # Compute the squared distances to the centers for each experiment.
  #
  xy.center <- apply(xy, c(1,2), mean)
  xy.distances2 <- apply(xy-array(xy.center, c(n.iter,2,n)), c(1,3), 
                         function(z) sum(z^2))
  #
  # Optionally plot histograms.
  #
  if(plot) {
    xy.plot <- xy.distances2[xy.distances2 >= x.min & xy.distances2 <= x.max]

    hist(log(xy.plot), prob=TRUE, breaks=30,
         main=paste("Histogram of log squared distance, n=", n),
         xlab="Log squared distance")
    curve(dchisq(n/(n-1) * exp(x), df=2) * exp(x) * n/(n-1), 
          from=log(min(xy.plot)), to=log(max(xy.plot)), 
          n=513, add=TRUE, col="Red", lwd=2)

    hist(xy.plot, prob=TRUE, breaks=30,
         main=paste("Histogram of squared distance, n=", n),
         xlab="Squared distance")
    curve(n/(n-1) * dchisq(n/(n-1) * x, df=2), 
          from=min(xy.plot), to=max(xy.plot), 
          n=513, add=TRUE, col="Red", lwd=2)  
  }
  return(xy.distances2)
}
#
# Plot the histograms and compare to scaled chi-squared distributions.
#
par(mfrow=c(2,2))
set.seed(17)
xy.distances2 <- f(2, 10^5, exp(-6), 6)
xy.distances2 <- f(n <- 40, n.iter <- 50000, exp(-6), 12)
#
# Compare the last simulation to cumulative chi-squared distributions.
#
sigma.hat <- sqrt((n / (2*(n-1)) * mean(xy.distances2)))
print(cumsum(tabulate(cut(xy.distances2, 
                    (0:10) * (n-1)/n * sigma.hat^2))) / (n*n.iter), digits=4)
print(pchisq(1:10, df=2), digits=4)

Best Answer

Related Solutions

Solved – Standard deviation of revenue where numbers are only positive

Solved – How to calculate 2D standard deviation, with 0 mean, bounded by limits

Related Question