Statistics – How to Find Sample Standard Deviation from Histogram?

statistics

enter image description here

a) Hi, I need help with this question in the image. I know that the sample median is 37.7 as the data is heavily skewed to the right. But I don't know what numbers to assign to sample mean and sample standard deviation. How can I figure this out?

b) The distribution is bimodal and skewed to the right. But I don't know how to use this info to help me figure out what to assign to sample standard deviation and sample mean

Thank You

Best Answer

Estimating Sample Mean and SD from Histogram

Method:

A frequently used approach is to pretend that all observations in a histogram bin fall at the center of the bin. Centers are: $(m_1 = 30, m_2 = 50, \dots, m_8 = 170).$

Then try to read bin frequencies $f_1, \dots, f_8$ from the vertical scale of the histogram. Also, check that $n = \sum_{i=1}^8 f_i = 90,$ as claimed in the header.

A reasonable estimate of the sample mean is $$\bar X \approx \frac 1 n\sum_{i=1}^8 f_im_i.$$

And the sample variance is estimated as

$$S_x^2 \approx \frac{1}{n-1}\sum_{i=1}^8 f_i(m_i - \bar X)^2.$$

Take the square root to estimate the sample SD.

Example: Suppose we have the sample of $n = 90$ observations from $\mathsf{Exp}(\mathrm{rate}=0.02),$ an exponential distribution with mean $\mu = 50$ and standard deviation $\sigma = 50,$ as sampled in R below:

set.seed(919)
x = rexp(90, 0.02)
set.seed(919)
summary(x);  sd(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.4624  14.2365  25.9412  44.1609  54.6213 173.0243 
[1] 43.02516  # sample SD

cp = seq(0,180, by=20);  cp  # cut points for histogram
 [1]   0  20  40  60  80 100 120 140 160 180
hist(x, br = cp, ylim=c(0,40), label=T)

enter image description here

Estimate sample mean a and standard deviation s from the histogram:

m = seq(10, 170, by=20)
f = c(32,23,14,4,3,6,4,2,2)  # copied from histogram
sum(f)
[1] 90
a = sum(f*m)/90;  a
[1] 45.33333             # aprx 44.16
v = sum(f*(m-a)^2)/89; v
[1] 1807.191
s = sqrt(v);  s
[1] 42.51107             # aprx 43.03

Notice that in order to match roughly with your problem, my histogram required nine bins. The approximations from the histogram are not far from the exact computations for the data.

Addendum on "modes" in histograms. According to your view on modes, I guess the histogram below must have several modes. How many would you say? Three? Four?

enter image description here