Solved – Difference of means in unequal sample sizes, variances for a boolean data type

hypothesis testingr

I would like to test the difference in means of two samples in the same population. These samples come from a time series, but because of the way the data is structured, a reliable time series cannot be built.

I thought pooling the data into groups before/after a certain date, and test the difference between these groups.

Each element in this group is a boolean: 1 if condition x, 0 otherwise. As they are pooled into the before/after, the histogram of each looks something like this:

enter image description here

I thought about Welch's test for unequal samples and variances, but I am unsure of whether this is correct given the constrains.

This is the R command I thought should be ran:

t.test(before_year, after_year, alternative = "two.sided", var.equal = FALSE)
#for completeness, this is the output of the previous test:

    Welch Two Sample t-test

data:  before_time and after_time
t = 221.01, df = 28117, p-value < 0.00000000000000022
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6289935 0.6402500
sample estimates:
   mean of x    mean of y 
0.6353091867 0.0006874317 

Which leads me to reject the null in favor of the alternative. The difference in means is different to 0: The sample in before_time is more likely to be affected by condition 1.

Note that this is the whole actual population divided into two groups. An alternative option would be to actually sample of equal size from both sets, and do a t.test for the sample, but this seems like a wrong approach given the fact that I have the actual population at time X.

Best Answer

This is simply a test of binomial proportions. You have some number of elements in each group, $n_1, n_2$, and a certain number of those elements are what you could call "successes" in a Bernoulli trial, i.e. $x_{1_i} = 1$. The sum of all these successes is some number $y_1 = \sum x_{1_i}$. So really all you are doing is testing two binomial proportions $\frac{y_1}{n_1}, \frac{y_2}{n_2}$. There are several ways to do this, but based on your large sample size, the normal approximation should be just fine (assuming the $x$ are independent Bernoulli trials). Look at the prop.test function in R.

Just be aware that with your large sample sizes you are pretty much guaranteed to get a p-value less than the classic 0.05 (unless the numbers are nearly identical). You may want to think about what is a practically significant difference rather than settling for simple statistical significance.