Chi-Squared Test – How to Calculate Chi-Square and P-Value: A Comprehensive Guide

chi-squared-testp-valuestatistical significance

I'd like to test people's preference on model A against three other models B,C,D.

I asked 5000 people on crowd sourcing to rate all 4 models (thus, there are 20,000 ratings overall) in the scale of 1 to 5.

I've grouped the results into 3 categories, which are 1) model A won, 2) model A lost, and 3) model A and compared model received equal rating. Results are as following:

compared model  | A won |  equal  | A lost
B                  2208    1222     1570
C                  2970    538      1492
D                  1890    1454      1656

I calculated chi-square with 2 degrees of DOF, with expected value as 2,000 for winning and losing, and 1,000 for par, and got chi-squares of 133.37, 812.93, 271.33 for each comparison.

Is my calculation correct? And if so, most online calculators simply return "p-value is less than 0.000001", but is there a way to get specific p-values from chi-squares, however small they may be? Also, since chi-squares are very high and p-values very small (if correct), does it mean that the results are statistically significant?

======edit=======

Null hypothesis is that "all models are equally preferred".

Alternative hypothesis obviously is that model A is better.

========edit=====

I grouped the results into 3 categories, because each worker has a different range of ratings. For example, worker A may give 5 to good models, and 1 to bad models, while worker B may give 3 to good models and 2 to bad models, etc. Thus, std. dev turned out to be pretty large (over +/- 1.0) and I needed to simplify the results to figure out the significance.

Best Answer

No, your use of a chisquare test is inappropriate. One problem is that the same 5000 people appear more than once in your table, so the counts are not independent. Another problem is the use of arbitrarily chosen expected values.

One simple and correct way to test your hypotheses is to compare A with each of the other models one at a time. Comparing A to B, your table shows that 2208 people prefer A and 1570 prefer B. An exact one-sided p-value can be obtained from the binomial probability in R:

pbinom(2208-0.5, prob=0.5, size=2208+1570, lower.tail=FALSE)

which is 1.4e-25. This tests the null hypothesis that people are equally likely to prefer A or B vs the alternative that people are more likely to prefer A. The binomial test conditions on the total number of people who had a preference.

Comparing A to C gives:

pbinom(2970-0.5, prob=0.5, size=2970+1492, lower.tail=FALSE)

which is 1.1e-110. Comparing A to D gives:

pbinom(1890-0.5, prob=0.5, size=1890+1656, lower.tail=FALSE)

which is 4.5e-05.

So, yes, there is strong evidence that the respondents tend to prefer model A over each of the other three alternatives.

Related Question