Statistics – How to Find Sample Standard Deviation from Histogram?

statistics

a) Hi, I need help with this question in the image. I know that the sample median is 37.7 as the data is heavily skewed to the right. But I don't know what numbers to assign to sample mean and sample standard deviation. How can I figure this out?

b) The distribution is bimodal and skewed to the right. But I don't know how to use this info to help me figure out what to assign to sample standard deviation and sample mean

Thank You

Best Answer

Estimating Sample Mean and SD from Histogram

Method:

A frequently used approach is to pretend that all observations in a histogram bin fall at the center of the bin. Centers are: $(m_1 = 30, m_2 = 50, \dots, m_8 = 170).$

Then try to read bin frequencies $f_1, \dots, f_8$ from the vertical scale of the histogram. Also, check that $n = \sum_{i=1}^8 f_i = 90,$ as claimed in the header.

A reasonable estimate of the sample mean is $$\bar X \approx \frac 1 n\sum_{i=1}^8 f_im_i.$$

And the sample variance is estimated as

$$S_x^2 \approx \frac{1}{n-1}\sum_{i=1}^8 f_i(m_i - \bar X)^2.$$

Take the square root to estimate the sample SD.

Example: Suppose we have the sample of $n = 90$ observations from $\mathsf{Exp}(\mathrm{rate}=0.02),$ an exponential distribution with mean $\mu = 50$ and standard deviation $\sigma = 50,$ as sampled in R below:

set.seed(919)
x = rexp(90, 0.02)
set.seed(919)
summary(x);  sd(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.4624  14.2365  25.9412  44.1609  54.6213 173.0243 
[1] 43.02516  # sample SD

cp = seq(0,180, by=20);  cp  # cut points for histogram
 [1]   0  20  40  60  80 100 120 140 160 180
hist(x, br = cp, ylim=c(0,40), label=T)

Estimate sample mean a and standard deviation s from the histogram:

m = seq(10, 170, by=20)
f = c(32,23,14,4,3,6,4,2,2)  # copied from histogram
sum(f)
[1] 90
a = sum(f*m)/90;  a
[1] 45.33333             # aprx 44.16
v = sum(f*(m-a)^2)/89; v
[1] 1807.191
s = sqrt(v);  s
[1] 42.51107             # aprx 43.03

Notice that in order to match roughly with your problem, my histogram required nine bins. The approximations from the histogram are not far from the exact computations for the data.

Addendum on "modes" in histograms. According to your view on modes, I guess the histogram below must have several modes. How many would you say? Three? Four?

Related Solutions

[Math] How to calculate median and standard deviation from histogram

You can't calculate any of them exactly because all you have is the interval of values that they belong to and not their exact values. It is the mode and not the median that is in the tallest bin. You can determine which bin the median is in and thus know the two end points of its bin are values that it falls betweem. To find where the median is you just total the number of data points in each bin starting from the left unit the get to the integer equal to (n-1)/2 when n is odd and (n+1)/2 if (n-1)/2 and (n+1)/2 are in the same bin. If (n+1)/2 is in a higher bin then (n-1)/2 you can't be sure which bin the median is in but you know it is near the boundary separating the two adjacent bins.

You can calculate grouped mean and grouped variances which may be rough approximations to the actual sample means and variances but not exact.

[Math] Is it possible to calculate the mean and standard deviation from a median and quartiles

It's mathematically impossible to deduce mean or standard deviation from median/quartiles, because medians and quartiles discard most of the data on which the mean and standard deviation are based.

Example:

data   frequency  
   0       50      
 1.4        4     
   2       50

That has a mean of 1.0 and standard deviation of 0.9. (I'm using 2 significant figures so I don't have to go into population versus sample standard deviation.)

data     frequency    
   0       30        
 1.4       44        
   2       30

That data also has the median and quartiles the same as in your example, but now the mean is 1.2 and the standard deviation is 0.8.

data     frequency        
   0       30        
 1.4        3        
   2       70        
10000000    1

Now I've changed my maximum without changing the median or quartiles, you can see even more clearly how the median and quartiles exclude extreme data, because the mean is now 96000 and the standard deviation is 98000 (still 2 sig.fig.).

Best Answer

Related Solutions

[Math] How to calculate median and standard deviation from histogram

[Math] Is it possible to calculate the mean and standard deviation from a median and quartiles

Related Question