Solved – two samples, hypothesis test of proportions t or z test

proportion;t-testz-test

I have two samples of documents where one sample contains documents of a certain category and another sample contains documents of another category. I am analysing the occurrence of a certain word in the documents.

I want to do an hypothesis test to check if the proportions of this word is the same in both populations or not.
I know the sample sizes (200,40) and I know the proportions but I do not know the variances of the population.

The basic method is to assume that the difference of proportions is normal and to calculate a z-score for the difference of proportions. Is it correct even if I do not know the variances?

The reason why I am asking this question is that if I am comparing the means of two different samples with different means and I do not know the variances, I should use t-test and take into consideration that I do not know the variances.

Is there a corresponding t-test for significance in the proportions in the population?

Best Answer

Proportion tests are just particular cases of z-test and t-test, where our variable is a Bernouilli (and its mean a binomial), and therefore variance of variable just depends on proportion and variance of mean depends on proportion and sample size.

With that in mind, in principle the choice should be obvious: if we know population variance, we are doing a z-test, but if we just know sample variance we are doing a t-test. Then the only usual proportion test which is a true z-test is the proportion test for one sample because the null hypothesis is that the proportion is a given (known) value.

Anyway, any elementary statistical handbook presenting proportion tests would skip any reference to t-test and treat them just like z-tests. That might seem to contradict theory, but is backed by a very strong practical reason.

The practical reason is that usually samples for proportion tests are very large for two causes:

  • First, you can't assume normality if sample is not large.
  • Second, even if samples are large enough to yield normally distributed means, proportion tests power is very small if sample is not very large.

Then, sample sizes for proportion tests usually come in hundreds or thousands, and since t-Student distribution rapidly converges to normal when the number of degrees of freedom grows, there is no practical difference between performing a t-test or a much simpler z-test.

In summary: for reasonable sample sizes, you can go for z-score. In fact, for your sample size of (200,40) I would be more worried that the very small size of the second sample could rend your test too little powerful to be actually helpful, than for the tiny difference between t-score and z-score.

And just as an end note: all this also holds for confidence intervals on proportions.