Solved – Is paired t test sufficient to show the two data sets are similar

paired-datasimilaritiest-test

Let me give the details of my problem setting, we have a learning algorithm that can take a collection of algebra problems and automatically generate similar ones. Of course, an important question to ask is are the problems we generated indeed similar to the original ones. We invited 20 students to answer the 10 questions for each of the problem set: original problems and generated problems, and then we use a paired t test to compare the error rate of each student across the two problem sets. The p is calculated to be 0.1 for the two mean value 10.3% (original) and 14.2 (generated), now my question is does this suffice to show the two sets of questions are similar? If not, why and what should we do? Thanks very much.

Best Answer

No, it doesn't establish similarity, and indeed, it doesn't come close to answering the right question.

i) Failure to reject the null doesn't imply it's true. It simply implies your sample was too small to detect the difference.

ii) You're not actually even interested in the truth of the null you're testing. I presume you don't believe the distribution of errors is actually identical ... you must know that that would be an astronomically unlikely situation -- the new problems will have at least subtle differences from the original. You yourself use the word "similar" and the hypothesis you're testing isn't about that. With a large enough sample size, any difference in mean - no matter how trivial - might be rejected.


You may want to consider whether a confidence interval for the difference is a better tool. You can specify what you regard as the largest size of difference in error rate that is still reasonably consistent with being "similar", and see if the confidence interval contains it.

Alternatively, you might consider an equivalence test.

(I'm not 100% sure I'd use a t-test for this, since the number of errors will be fairly discrete, and there's also a potential issue with heteroskedasticity -- the variance of a difference will tend to be larger for students whose mean number of errors is larger -- but it may do well enough.)


Of course this only detects situations where the mean difference is away from zero.

It's possible that the variation is such that some people score higher and some score lower (for example, weaker students may typically score lower than they do on the original test and stronger students typically score higher than on the original test) and a paired test of means won't detect that. If you want to be able to identify that kind of change, you will want a different test.

Related Question