I have two samples of documents where one sample contains documents of a certain category and another sample contains documents of another category. I am analysing the occurrence of a certain word in the documents.
I want to do an hypothesis test to check if the proportions of this word is the same in both populations or not.
I know the sample sizes (200,40) and I know the proportions but I do not know the variances of the population.
The basic method is to assume that the difference of proportions is normal and to calculate a z-score for the difference of proportions. Is it correct even if I do not know the variances?
The reason why I am asking this question is that if I am comparing the means of two different samples with different means and I do not know the variances, I should use t-test and take into consideration that I do not know the variances.
Is there a corresponding t-test for significance in the proportions in the population?
Best Answer
Proportion tests are just particular cases of z-test and t-test, where our variable is a Bernouilli (and its mean a binomial), and therefore variance of variable just depends on proportion and variance of mean depends on proportion and sample size.
With that in mind, in principle the choice should be obvious: if we know population variance, we are doing a z-test, but if we just know sample variance we are doing a t-test. Then the only usual proportion test which is a true z-test is the proportion test for one sample because the null hypothesis is that the proportion is a given (known) value.
Anyway, any elementary statistical handbook presenting proportion tests would skip any reference to t-test and treat them just like z-tests. That might seem to contradict theory, but is backed by a very strong practical reason.
The practical reason is that usually samples for proportion tests are very large for two causes:
Then, sample sizes for proportion tests usually come in hundreds or thousands, and since t-Student distribution rapidly converges to normal when the number of degrees of freedom grows, there is no practical difference between performing a t-test or a much simpler z-test.
In summary: for reasonable sample sizes, you can go for z-score. In fact, for your sample size of (200,40) I would be more worried that the very small size of the second sample could rend your test too little powerful to be actually helpful, than for the tiny difference between t-score and z-score.
And just as an end note: all this also holds for confidence intervals on proportions.