I'm trying to compare the increase in percentage of $X$ variable in 2000 and in 2012, so I calculated the percentage of the $X$ variable in 2000 and in 2012 and subtracted the difference. My question is if the sample size is different is it still correct to compare the percentage? For example:
Variable X | Count in 2000 | Count in 2012 |
---|---|---|
A | 89 | 114 |
B | 9 | 33 |
Total sample size | 98 | 147 |
So is it correct to say that there is an increase in $A$ even though the sample size of 2012 is higher?
Can you please help me to understand this?
Edits:
The 2 data sets are for 2 different cohorts, one before 2000 and other in 2012. So the numbers are different because it represents all the patients at those periods. So I'm comparing the change of the percentage by counting how many of $A$ at the variable $X$ in 2000 and divide it by the total amount of the variable $X$ and times it by 100 to get the percentage. The I do the same thing for $A$ in 2012 to get the percentage and then subtract the difference. Note: Variable $X$ is categorical variable (for example: $X$ = A,A,A,A,A,B,A,T,A,B).
% of A in 2000 = Counts of A in 2000/ total counts of variable X in 2000 (A +B)*100
% of A in 2012 = Counts of A in 2012/ total counts of variable X in 2012 (A +B)*100 (A +B)*100
My concern is the difference in percentage that I'm getting is related to real increase in $A$ in 2012 or it is just because of the data set in 2012 is larger? How can I correct for this difference? Can I do a proportion test to check if this increase is real? The null hypothesis for the proportion test would be:
Null= There is no increase in $A$
Alternative= There is an increase in $A$.
Is it correct to use the proportion test to check if the difference is real and not related to the differences in sample size?
Please,any help would be very appreciated.
Best Answer
It also seems appropriate to use
prop.test
in R, as below. The proportion of A's in the two years are about $0.978$ and $0.776,$ which are judged to be significantly different proportions because the P-value is near $0.$This is essentially the same as a chi-squared test of homogeneity of the $2\times 2$ table of counts with rows for A and B and columns for 2000 and 2012. [On account of the moderately large sample sizes, one might omit the continuity correction, with parameter
cor=F
inprop.test
and inchisq.test
on the $2 \times 2$ table mentioned just above.]