The number of boys in 500 families with 5 children is investigated. There were 20 families with no boy, 75 with 1, 145 with 2, 140 with 3, 85 with 4, and 35 with 5 boys. Decide (with level of significance α = 0.05) whether the number of boys in a 5-children family follows binomial distribution.

For testing that the data has a binomial distribution using Pearson's chi-squared test: Let X be the number of boys with probability of getting boy $p$ . I want to test $B(5,p)$ is a reasonable model for the distribution of X. The sample mean is 2.6. After this I get stuck.

Perhaps it is time for a more complete answer.

You have values $x = (0,1,2,3,4,5)$ with corresponding observed frequencies $f = (20,75,145,140,85,35),$ totaling $m = \sum_x f_x = 500.$ Relative frequencies are $r_x = f_x/m.$

Thus the sample mean is $\bar X = \sum_x xr_x = 2.6,$ which is the estimate of the binomial mean $\mu = n\theta,$ where $n = 5$ is the number of independent trials and $\theta$ is the success probability. Because the estimated mean is $\bar X = \hat \mu = n\hat \theta = 2.6,$ we have $\hat \theta = 2.6/5 = 0.52.$

x = 0:5; f = c(20,75,145,140,85,35) 
## 500 
r = f/sum(f);  sum(x*f)/500 
## 2.6 
p = 2.6/5;  p 
## 0.52 

Our null hypothesis is that the distribution $Binom(5, 0.52)$ is an appropriate model for the number of successes. Under this distribution the probabilities $p_x$ of the values $x$ are given as pdf in the R output below. Under this binomial model, the expected frequencies $E_x = 500p_x,$ also shown below.

pdf = dbinom(x, 5, .52);  E = 500*pdf
cbind(x, f, r, pdf, E)
##     x   f    r       pdf         E
##     0  20 0.04 0.0254804  12.74020
##     1  75 0.15 0.1380188  69.00941
##     2 145 0.29 0.2990408 149.52038
##     3 140 0.28 0.3239608 161.98042
##     4  85 0.17 0.1754788  87.73939
##     5  35 0.07 0.0380204  19.01020

The chi-squared goodness-of-fit (GOF) statistic is $$Q = \sum_x \frac{(f_x - E_x)^2}{E_x},$$ which is approximately distributed as $Chisq(df = 6-2).$ The approximation is valid because all of the $E_x > 5.$ If we had been given the specific binomial parameters $n$ and $\theta,$ the degrees of freedom would have been $6 - 1,$ but we have estimated one parameter $\theta$ so $df = 6-2 = 4.$

Good agreement ('fit') between the observed frequencies $f_x$ and the expected frequencies $E_x$ gives small values of $Q.$ (Very unlikely perfect agreement would give $Q = 0.$) We reject the model $Binom(5, 0.52)$ for large values of $Q$.

The 95th percentile of $Chisq(df = 4)$ is 9.49, so we reject at the 5% level of significance if $Q > 9.49.$ For our data, the value of the test statistic is $Q = 21.31$, so we reject the model as unreasonable at the 5% level. The P-value 0.0003 is the probability $P(Q > 21.31),$ if the true distribution were $Q \sim Chisq(df = 4).$ The chances of seeing the observed frequencies $f_x$ if the true model where $Binom(5, 0.52)$ are very small. [Looking at the earlier table, we see one major discrepancy is that we have many fewer (140) than the expected number (161.98) of families with three boys.]

Q = sum((f-E)^2/E); Q 
## 21.31109 
qchisq(.95, 6-2) 
## 9.487729
1 - pchisq(21.3, 4)
## 0.0002761148

One way to visualize the poor fit is to compare the probabilities $p_x$ (bars) from the hypothetical model with the relative frequencies ($\times$'s) $r_x$ actually observed. But such graphical comparisons do not reveal the total number of cases observed, and so they should only be viewed along with results from formal GOF tests.

