Solved – Bootstrapping and comparing multiple proportions

bootstrapp-valueprobabilityproportion;r

EDIT 1: Thanks to @gung for pointing out that if the bag initially had 324 sweets, then as Mother hands them out, there are consecutively fewer amounts she can hand out. To simplify things (because I'm interested in understanding the basics instead of something too advanced), he offered a rewording.

EDIT 2: This is not homework. I'm just trying to learn some statistics in a fun way

COMMENT: From all of the kind comments/answers thus far, it occurs to me that I really need to work on defining my questions in the future much better because I didn't realise how much the answer could vary depending on how it's posed (very useful lesson!). I initially thought this was well posed but it's good to learn these things and hopefully improve, thank you.

Let's say a a mother has a bag containing an infinite number of sweets.

We keep a track of how many sweets she gives to each of her children named "A", "B", …, "G" from this bag, and how many she gives out to an unknown number of adults. When the mother comes up to each person, she opens a new, identical bag of sweets, reaches in & grabs some for that individual. At the end of the day, she has given out a total of 324 sweets!

> (DF <- data.frame(A=15, B=4, C=1, D=4, E=44, F=4, G=1, Adults=251, Total=324))

   A B C D  E F G Adults Total
  15 4 1 4 44 4 1    251   324

Question 1: Her children were given a total of 73 sweets out of 324. The adults were give 251 sweets out of 324. They children want to know if the adults were given a statistically significant proportion of the sweets even though their mother claims that everyone had an equal chance of getting the same number of sweets and it was just luck that some got more than others. So basically we want to compare the entire group of children against the entire group of adults, not individuals.

Question 2: Did one or more of the children get a statistically significant number of more sweets when compared to their siblings (i.e. we ignore the Adults for this question and only consider the children "A", …, "G")? And if so, which one(s) got a statistically significant proportion? Mother again claims everyone had an equal chance of getting the same number of sweets, and it was just luck that some got more than others.

Additional Information: I want to use a bootstrapping approach because I seem to get a better feel for probability when using simulation (NB: I am not a statistician, just doing this for fun). Below I will give my approach to the first question which I think is OK but would appreciate some advice on if I'm doing it correctly. I have no idea how to approach the second question because there are multiple proportions involved.


My approach to Question One:

Hypotheses: My null hypothesis is that children and adults had an equal chance of getting the same number of sweets. My alternative hypothesis is that Mother is more generous with giving out sweets to adults than to her children.

Research Question: How often might we observe a proportion of sweets as small as the children got, if the mother really was handing out sweets fairly?

Methodology: We can test this directly by creating a hypothetical population that embodies the null hypothesis, with 50% "children" and 50% "adults" and repeatedly drawing random samples of 324 from it, with replacement, recording the results. We then look at how many of these samples have 73 or fewer "children" nodes in them (this is our p-value).

replicates <- 999
size <- 324
runs <- list(replicates)
reference <- 73/size

for(i in 1:replicates){
  runs[[i]] <- table(sample(x=c("children", "adult"), size=size, replace=TRUE, prob=c(0.5, 0.5))) 
}

bootstraps <- as.data.frame(do.call(rbind, runs))
#   adult children
# 1   166      158
# 2   171      153
# 3   158      166
# 4   151      173
# 5   156      168
# etc.

sum(bootstraps$children <= reference) / replicates
#[1] 0

Conclusion: My p-value is effectively zero, which means there is strong evidence that Mother has been handing out sweets unfairly because getting 73 or fewer sweets is very unlikely to have happened by luck alone.

As for question 2, I'm drawing a complete blank 🙁

Best Answer

The questions, especially the second one, are meaningless as they stand right now. The problem is that the concept of "taking a random number of candies from the bag" is not defined. Even with a finite number of candies in the bag there could be multiple definitions. For example, the following two both sound reasonable, but give different results:

  1. If there are $n$ candies in the bag, the probabilities of taking exactly $0, 1, \ldots , n$ candies are all the same: $1/(n+1)$.
  2. We go through each candy, and decide whether to take it out with probability $p$. This means that the probability of exactly $x$ candies is ${n\choose x} p^x (1-p)^{n-x}$.

Once you go to an infinite bag, neither of those options apply. So all we can say that there is some unknown distribution that gives the probability of $x$ candies. Using the bootstrap idea you would estimate this as the observed distribution: $P(1 \text{ candy})=2/7$, $P(4 \text{ candies})=3/7$, $P(15 \text{ candies})=1/7$, and $P(44 \text{ candies})=1/7$. Note that this implies that the question of comparing the number of candies received by different children is almost meaningless. You could potentially calculate the probability of receiving the actual number or fewer candies for each child, and some would be less lucky than others, but somebody would have to be less lucky by definition.

As for the first question, you would need to bootstrap from the observed distribution after making some assumption about the adults, or the process that sends children/adults to Mom. I can think of several options, but nothing is totally satisfactory, as you would want to keep the number of children fixed at 7 and the total number of candies fixed at 324 while keeping the observed distribution of candies per handful and varying the number of adults appropriately. Perhaps letting go of some of these conditions (eg total number of candies) is reasonable.

Related Question