[Math] How to test if data follows a distribution

binomial distributionbiologyhypothesis testingpoisson distribution

Have been given some data and the question says to determine if the data follows any distribution. It says to compare the observed data vs expected graphically and to test further. The distributions we've studied so far are normal, binomial and poisson so I assume it will be one of these. We've used these techniques so far: 1 and 2 sample t-tests, chi-squared tests of association and goodness of fit, one way and two way anova, confidence intervals, estimating abundance techniques, linear regression amongst a few others.

The data is as follows, 100 cows hooves are swabbed and checked for a bacteria. The results are as follows. The bacteria survives on the hooves for weeks, and on the grass for days.

No. of hooves that test positive per animal - Frequency
[0 - 25]
[1 - 5]
[2 - 15]
[3 - 30]
[4 - 25]

What I have done so far, I've calculated the normal, poisson and binomial distribution probabilities and the expected frequencies from them. But, I believe that the Binomial distribtuon is the right distribution as the number is limited (maximum of 4 hooves). I places the observed v expected in a bar chart to graphically compare and they do not appear to match.

I think my next steps should be carrying out a chi squared test of fitness but after that, I don't know what more to do or if I need to. Do I need to calculate a Confidence intervals, and what do I use it for? Is there anything else you would recommend?

Thanks in advance!

Best Answer

A Mixture Model for Exposed and Unexposed Animals

Suppose 1/4 of the cows are unexposed to bacteria, and among the 3/4 of cows that are exposed the number of hooves with bacteria is Binomial with n = 4 and p = 3/4. This model gives the probability table of hooves with bacteria shown in the table below.

This binomial distribution was deduced from the fact that there are 225 hooves with bacteria out of 75 exposed animals for an average of 3 hooves per animal. So the binomial mean must be $\mu = 3 = 4p$, whence $p = 3/4.$

Each of the probabilities for 1 through 4 hooves is 3/4 of the probabilities assigned by $Bin(4, 3/4).$ The probability for 0 is .25 plus the the 3/4 of the binomial probability. (Probabilities are rounded to four places and slightly 'fudged' in the fourth place so probabilities add to 1. This method works without complication only because the binomial part of the model contributes extremely little probability for 0 hooves.)

Expected counts are probabilities multiplied by 100 cows. Observed counts are the data reported in the problem.

 Hooves       0      1      2      3      4
          ---------------------------------
 Prob     .2528  .0351  .1586  .3163  .2372  
 Exp      25.28   3.51  15.86  31.63  23.72
 Obs      25      5     15     30     25

The standard chi-squared goodness-of-fit test (as implemented in R) gives the output shown below.

 prob=c(.2528, .0351, .1586, .3163, .2372)
 obs = c(25, 5, 15, 30, 25) 
 chisq.test(obs, p=prob)

 ##   Chi-squared test for given probabilities
 ##
 ## data:  obs 
 ## X-squared = 0.8353, df = 4, p-value = 0.9337

There is a warning message because the expected count in cell 1 is less than 5, putting the approximation of the chi-squared statistic to the chi-squared distribution in some doubt. However, an exact test (simulated permutation test) gave a P-value of 0.9371.

So there is no question that that the observed counts are consistent with the proposed probability model. (Other distributions might fit as well, but the question implied we should look for an answer based on a binomial or Poisson distribution. The data fit the model almost 'too well', suggesting that the data might have been contrived to make the solution to the problem easier to find.)

Related Solutions

[Math] Chi-squared goodness-of-fit test whether the data follows binomial distribution

Perhaps it is time for a more complete answer.

You have values $x = (0,1,2,3,4,5)$ with corresponding observed frequencies $f = (20,75,145,140,85,35),$ totaling $m = \sum_x f_x = 500.$ Relative frequencies are $r_x = f_x/m.$

Thus the sample mean is $\bar X = \sum_x xr_x = 2.6,$ which is the estimate of the binomial mean $\mu = n\theta,$ where $n = 5$ is the number of independent trials and $\theta$ is the success probability. Because the estimated mean is $\bar X = \hat \mu = n\hat \theta = 2.6,$ we have $\hat \theta = 2.6/5 = 0.52.$

x = 0:5; f = c(20,75,145,140,85,35) 
sum(f) 
## 500 
r = f/sum(f);  sum(x*f)/500 
## 2.6 
p = 2.6/5;  p 
## 0.52

Our null hypothesis is that the distribution $Binom(5, 0.52)$ is an appropriate model for the number of successes. Under this distribution the probabilities $p_x$ of the values $x$ are given as pdf in the R output below. Under this binomial model, the expected frequencies $E_x = 500p_x,$ also shown below.

pdf = dbinom(x, 5, .52);  E = 500*pdf
cbind(x, f, r, pdf, E)
##     x   f    r       pdf         E
##     0  20 0.04 0.0254804  12.74020
##     1  75 0.15 0.1380188  69.00941
##     2 145 0.29 0.2990408 149.52038
##     3 140 0.28 0.3239608 161.98042
##     4  85 0.17 0.1754788  87.73939
##     5  35 0.07 0.0380204  19.01020

The chi-squared goodness-of-fit (GOF) statistic is $$Q = \sum_x \frac{(f_x - E_x)^2}{E_x},$$ which is approximately distributed as $Chisq(df = 6-2).$ The approximation is valid because all of the $E_x > 5.$ If we had been given the specific binomial parameters $n$ and $\theta,$ the degrees of freedom would have been $6 - 1,$ but we have estimated one parameter $\theta$ so $df = 6-2 = 4.$

Good agreement ('fit') between the observed frequencies $f_x$ and the expected frequencies $E_x$ gives small values of $Q.$ (Very unlikely perfect agreement would give $Q = 0.$) We reject the model $Binom(5, 0.52)$ for large values of $Q$.

The 95th percentile of $Chisq(df = 4)$ is 9.49, so we reject at the 5% level of significance if $Q > 9.49.$ For our data, the value of the test statistic is $Q = 21.31$, so we reject the model as unreasonable at the 5% level. The P-value 0.0003 is the probability $P(Q > 21.31),$ if the true distribution were $Q \sim Chisq(df = 4).$ The chances of seeing the observed frequencies $f_x$ if the true model where $Binom(5, 0.52)$ are very small. [Looking at the earlier table, we see one major discrepancy is that we have many fewer (140) than the expected number (161.98) of families with three boys.]

Q = sum((f-E)^2/E); Q 
## 21.31109 
qchisq(.95, 6-2) 
## 9.487729
1 - pchisq(21.3, 4)
## 0.0002761148

One way to visualize the poor fit is to compare the probabilities $p_x$ (bars) from the hypothetical model with the relative frequencies ($\times$'s) $r_x$ actually observed. But such graphical comparisons do not reveal the total number of cases observed, and so they should only be viewed along with results from formal GOF tests.

Best Answer

Related Solutions

[Math] Chi-squared goodness-of-fit test whether the data follows binomial distribution

Related Question