Solved – two samples, hypothesis test of proportions t or z test

proportion;t-testz-test

I have two samples of documents where one sample contains documents of a certain category and another sample contains documents of another category. I am analysing the occurrence of a certain word in the documents.

I want to do an hypothesis test to check if the proportions of this word is the same in both populations or not.
I know the sample sizes (200,40) and I know the proportions but I do not know the variances of the population.

The basic method is to assume that the difference of proportions is normal and to calculate a z-score for the difference of proportions. Is it correct even if I do not know the variances?

The reason why I am asking this question is that if I am comparing the means of two different samples with different means and I do not know the variances, I should use t-test and take into consideration that I do not know the variances.

Is there a corresponding t-test for significance in the proportions in the population?

Best Answer

Proportion tests are just particular cases of z-test and t-test, where our variable is a Bernouilli (and its mean a binomial), and therefore variance of variable just depends on proportion and variance of mean depends on proportion and sample size.

With that in mind, in principle the choice should be obvious: if we know population variance, we are doing a z-test, but if we just know sample variance we are doing a t-test. Then the only usual proportion test which is a true z-test is the proportion test for one sample because the null hypothesis is that the proportion is a given (known) value.

Anyway, any elementary statistical handbook presenting proportion tests would skip any reference to t-test and treat them just like z-tests. That might seem to contradict theory, but is backed by a very strong practical reason.

The practical reason is that usually samples for proportion tests are very large for two causes:

First, you can't assume normality if sample is not large.
Second, even if samples are large enough to yield normally distributed means, proportion tests power is very small if sample is not very large.

Then, sample sizes for proportion tests usually come in hundreds or thousands, and since t-Student distribution rapidly converges to normal when the number of degrees of freedom grows, there is no practical difference between performing a t-test or a much simpler z-test.

In summary: for reasonable sample sizes, you can go for z-score. In fact, for your sample size of (200,40) I would be more worried that the very small size of the second sample could rend your test too little powerful to be actually helpful, than for the tiny difference between t-score and z-score.

And just as an end note: all this also holds for confidence intervals on proportions.

Related Solutions

Chi-Squared Test – Understanding Relationship with Test of Equal Proportions

Very short answer:

The chi-Squared test (chisq.test() in R) compares the observed frequencies in each category of a contingency table with the expected frequencies (computed as the product of the marginal frequencies). It is used to determine whether the deviations between the observed and the expected counts are too large to be attributed to chance. Departure from independence is easily checked by inspecting residuals (try ?mosaicplot or ?assocplot, but also look at the vcd package). Use fisher.test() for an exact test (relying on the hypergeometric distribution).

The prop.test() function in R allows to test whether proportions are comparable between groups or does not differ from theoretical probabilities. It is referred to as a $z$-test because the test statistic looks like this:

$$ z=\frac{(f_1-f_2)}{\sqrt{\hat p \left(1-\hat p \right) \left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} $$

where $\hat p=(p_1+p_2)/(n_1+n_2)$, and the indices $(1,2)$ refer to the first and second line of your table. In a two-way contingency table where $H_0:\; p_1=p_2$, this should yield comparable results to the ordinary $\chi^2$ test:

> tab <- matrix(c(100, 80, 20, 10), ncol = 2)
> chisq.test(tab)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab 
X-squared = 0.8823, df = 1, p-value = 0.3476

> prop.test(tab)

    2-sample test for equality of proportions with continuity correction

data:  tab 
X-squared = 0.8823, df = 1, p-value = 0.3476
alternative hypothesis: two.sided 
95 percent confidence interval:
 -0.15834617  0.04723506 
sample estimates:
   prop 1    prop 2 
0.8333333 0.8888889

For analysis of discrete data with R, I highly recommend R (and S-PLUS) Manual to Accompany Agresti’s Categorical Data Analysis (2002), from Laura Thompson.

Solved – Hypothesis testing: difference between proportions

If you truly have the whole population of interest, there's no need for a hypothesis test at all. The point of hypothesis tests are to make inferences about populations. If you have the population you don't need to infer its characteristics from a sample ... you simply look at it. The null is either true or false and you can say which is true (for certain).

(If you have a large fraction of the population of interest, for some tests you have to worry about finite sample effects.)

To compare two proportions, some commonly used tests include the two-proportions Z-test, the chi-square test and Fishers exact test.

Fisher's exact test conditions on both margins in the 2x2 table (and for that matter, the chi-square is also useable in that situation). So by conditioning on both margins, the finite sample issue should be a non-issue -- it's taken care of by the conditioning.

Best Answer

Related Solutions

Chi-Squared Test – Understanding Relationship with Test of Equal Proportions

Solved – Hypothesis testing: difference between proportions

Related Question