Solved – Bootstrapping and comparing multiple proportions

bootstrapp-valueprobabilityproportion;r

EDIT 1: Thanks to @gung for pointing out that if the bag initially had 324 sweets, then as Mother hands them out, there are consecutively fewer amounts she can hand out. To simplify things (because I'm interested in understanding the basics instead of something too advanced), he offered a rewording.

EDIT 2: This is not homework. I'm just trying to learn some statistics in a fun way

COMMENT: From all of the kind comments/answers thus far, it occurs to me that I really need to work on defining my questions in the future much better because I didn't realise how much the answer could vary depending on how it's posed (very useful lesson!). I initially thought this was well posed but it's good to learn these things and hopefully improve, thank you.

Let's say a a mother has a bag containing an infinite number of sweets.

We keep a track of how many sweets she gives to each of her children named "A", "B", …, "G" from this bag, and how many she gives out to an unknown number of adults. When the mother comes up to each person, she opens a new, identical bag of sweets, reaches in & grabs some for that individual. At the end of the day, she has given out a total of 324 sweets!

> (DF <- data.frame(A=15, B=4, C=1, D=4, E=44, F=4, G=1, Adults=251, Total=324))

   A B C D  E F G Adults Total
  15 4 1 4 44 4 1    251   324

Question 1: Her children were given a total of 73 sweets out of 324. The adults were give 251 sweets out of 324. They children want to know if the adults were given a statistically significant proportion of the sweets even though their mother claims that everyone had an equal chance of getting the same number of sweets and it was just luck that some got more than others. So basically we want to compare the entire group of children against the entire group of adults, not individuals.

Question 2: Did one or more of the children get a statistically significant number of more sweets when compared to their siblings (i.e. we ignore the Adults for this question and only consider the children "A", …, "G")? And if so, which one(s) got a statistically significant proportion? Mother again claims everyone had an equal chance of getting the same number of sweets, and it was just luck that some got more than others.

Additional Information: I want to use a bootstrapping approach because I seem to get a better feel for probability when using simulation (NB: I am not a statistician, just doing this for fun). Below I will give my approach to the first question which I think is OK but would appreciate some advice on if I'm doing it correctly. I have no idea how to approach the second question because there are multiple proportions involved.

My approach to Question One:

Hypotheses: My null hypothesis is that children and adults had an equal chance of getting the same number of sweets. My alternative hypothesis is that Mother is more generous with giving out sweets to adults than to her children.

Research Question: How often might we observe a proportion of sweets as small as the children got, if the mother really was handing out sweets fairly?

Methodology: We can test this directly by creating a hypothetical population that embodies the null hypothesis, with 50% "children" and 50% "adults" and repeatedly drawing random samples of 324 from it, with replacement, recording the results. We then look at how many of these samples have 73 or fewer "children" nodes in them (this is our p-value).

replicates <- 999
size <- 324
runs <- list(replicates)
reference <- 73/size

for(i in 1:replicates){
  runs[[i]] <- table(sample(x=c("children", "adult"), size=size, replace=TRUE, prob=c(0.5, 0.5))) 
}

bootstraps <- as.data.frame(do.call(rbind, runs))
#   adult children
# 1   166      158
# 2   171      153
# 3   158      166
# 4   151      173
# 5   156      168
# etc.

sum(bootstraps$children <= reference) / replicates
#[1] 0

Conclusion: My p-value is effectively zero, which means there is strong evidence that Mother has been handing out sweets unfairly because getting 73 or fewer sweets is very unlikely to have happened by luck alone.

As for question 2, I'm drawing a complete blank 🙁

Best Answer

The questions, especially the second one, are meaningless as they stand right now. The problem is that the concept of "taking a random number of candies from the bag" is not defined. Even with a finite number of candies in the bag there could be multiple definitions. For example, the following two both sound reasonable, but give different results:

If there are $n$ candies in the bag, the probabilities of taking exactly $0, 1, \ldots , n$ candies are all the same: $1/(n+1)$.
We go through each candy, and decide whether to take it out with probability $p$. This means that the probability of exactly $x$ candies is ${n\choose x} p^x (1-p)^{n-x}$.

Once you go to an infinite bag, neither of those options apply. So all we can say that there is some unknown distribution that gives the probability of $x$ candies. Using the bootstrap idea you would estimate this as the observed distribution: $P(1 \text{ candy})=2/7$, $P(4 \text{ candies})=3/7$, $P(15 \text{ candies})=1/7$, and $P(44 \text{ candies})=1/7$. Note that this implies that the question of comparing the number of candies received by different children is almost meaningless. You could potentially calculate the probability of receiving the actual number or fewer candies for each child, and some would be less lucky than others, but somebody would have to be less lucky by definition.

As for the first question, you would need to bootstrap from the observed distribution after making some assumption about the adults, or the process that sends children/adults to Mom. I can think of several options, but nothing is totally satisfactory, as you would want to keep the number of children fixed at 7 and the total number of candies fixed at 324 while keeping the observed distribution of candies per handful and varying the number of adults appropriately. Perhaps letting go of some of these conditions (eg total number of candies) is reasonable.

Related Solutions

Solved – Nested ANOVA: Unequal sample sizes? Variance components

Hopefully your friend has graduated by now, but if not, the following might help.

You were on the right track in your original post Partitioning variance from logistic regression, using glmer() for mixed-effects logistic regression.

I would recommend against: the advisor's "solution", using lm() for logistic regression, and weighting rows equally (you should weight by N_indiv).

Generalized linear mixed models are tough. http://glmm.wikidot.com/faq has some good information - including the fact that you need many levels of a random factor in order to estimate its variance component.

My code below requires the lme4 package and the data from your link.

# Seroprevalance has been rounded, that's not OK
# to do logistic regression, (proportion * weight) must equal an integer
prev$seroexact <- round(prev$Seroprevalence * prev$N_indiv)/prev$N_indiv

# Host.Species is nested within Social.system, but you didn't reuse 
# species letters between Social.systems, so you can specify 
# Host.Species as a random effect without explicitly nesting it

# First random effect model
prev1.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species),
                  family=binomial(link="logit"), weights=N_indiv, data = prev)
summary(prev1.glmer)

## Fixed effects:
# Intercept is pathogen A and social.system A.  
# The z-test of the intercept is testing if the logit=0
# I.e. it's testing whether the combination of
# pathogen A and social.system A has prob=0.5.
# The other z-tests are testing whether other levels of the factors
# yield different probabilities than pathogen A and social.system A

## Random effects:
# This doesn't give you separate Host.Species and residual variances,
# Host.Species is treated as a random effect, so this model is the same as if
# you had summed the results of all studies with identical values of
# Host.Species, Pathogen, and Social.System. I.e. sum the results of the
# first 8 rows and create a single proportion and N_indiv, like so:

prevsum<-aggregate(cbind(N_indiv, prop=(seroexact*N_indiv)) ~ 
                   Social.System+Host.Species+Pathogen, data=prev, sum)
prevsum$prop<-prevsum$prop/prevsum$N_indiv

# which gives the same model:
prevsum.glmer = glmer(prop ~ Pathogen + Social.System + (1|Host.Species),
                      family=binomial(link="logit"), weights=N_indiv, data = prevsum)
summary(prevsum.glmer)

# So why are they broken up into multiple rows?  If each row represents
# one geographic area/time/litter/study/etc. then animals in one row
# might be more similar to eachother than they are to animals in
# another row that has the same values of Social, Species, & Pathogen.
# I think this is what the advisor wants as a "residual".

# To allow a random component for each row:
prev2<-cbind(resid=paste("Row_", row.names(prev), sep=""), prev)

prev2.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                   family=binomial(link="logit"), weights=N_indiv, data = prev2)
summary(prev2.glmer)

# This isn't a bad start, but I'm not comfortable with it because:
table(prev2[,2:3])

# Social.Sytstem D is only observed in Species F.
# This is called confounding, and it makes it hard to draw conclusions
# about Social Sytstem D.  How do you know what is caused by social
# system D and what is caused by species F?  If your friend really wants to
# make inferences about Social System D, she should collect data from
# another host species that uses Social System D.

# Leave out Soc_D:
prev3.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev3.glmer)

# Even though Host Species is conceptually a random factor, you really need to observe
# more than 2 species per social system for a mixed model to accurately estimate
# the species variance.  As far as species variance is concerned, each species is a
# single sample (not animals or even litters), and you can't hope to estimate variance
# accurately with only two samples.

# We can fit the model with species as a fixed effect, but we don't have
# enough degrees of freedom to estimate all levels of Species:
prev4.glmer = glmer(seroexact ~ Pathogen + Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# Your friend doesn't need to estimate the level of each species in order to test
# whether species has any noticeable effect at all.  Unfortunately, we can't just
# Use the F statistic from anova() because calculating the denominator df for a
# GLMM is not straightforward.
anova(prev4.glmer) #Gives you an F statistic, but no denominator df or p-value

# Instead we fit a simpler model without Species:
prev5.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# And we'll compare the two models With a Likelihood-Ratio test using anova()
anova(prev5.glmer,prev4.glmer)

# With a p-value of 0.01331 we can say it's worth keeping Species in the model.

# Now let's check the pathogen * social system interaction:
prev6.glmer = glmer(seroexact ~ Pathogen * Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, nAGQ=2, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev6.glmer) #Neither interaction term is significant
anova(prev6.glmer)
# We don't need a denominator df to know that the F statistic of 0.0774 for
# the interaction is insignificant.

# Since the interaction between Pathogen and Social System was not significant,
# we don't need to include the interaction term.  Similarly, I don't see a 
# statistical reason to  split the model into two separate 'pathogen specific'
# models, but maybe there's a scientific reason to do so:

# Separate tests for each pathogen:
prev7A.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_A",])
summary(prev7A.glmer)
# Social System B looks different from Social System A in pathogen A prevalance:

# Calculate the odds of having Pathogen A for Social System A vs B
beta7A<-fixef(prev7A.glmer)
exp(-beta7A[2]) #negative sign means odds of A:B instead of B:A
# So animals with Social System A have about 25 times the odds of
# animals with social system B of having Pathogen A

# Test for Pathogen B:
prev7B.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_B",])
summary(prev7B.glmer)
# The only significant effects are species-specific, which are not of interest

# Let's return to prev4.glmer, which models both pathogens:
summary(prev4.glmer)

# The only significant fixed effect in prev4.glmer is Pathogen.
beta4<-fixef(prev4.glmer)

# For a randomly selected animal, the odds of having Pathogen B to having Pathogen A are:
exp(beta4[2])

# That's about as much as you can interpret with the data she has.

# To answer the Advisor's request for variance components:
# Residual variance is:
getME(prev4.glmer, "theta")^2

# You can't do a good job of estimating species variance with these data.
# If her advisor won't listen, then you can tell him that your estimate is:
getME(prev3.glmer, "theta")[2]^2
# But it's a really crappy estimate.

# There is no such thing as a variance component for Social System because
# it's a fixed effect.  But you can get its sum of squares:
anova(prev4.glmer)

Best Answer

Related Solutions

Solved – Nested ANOVA: Unequal sample sizes? Variance components

Related Question