Solved – Proper statistical analysis for similarity of two paired datasets

calibrationhypothesis testingsimilaritiest-test

I am an engineering student and as part of an undergraduate research, I created a device that measures a certain value. However, there is a commercial device that measures that same value.

Here's what I want: I want to measure if the values my device measures are similar enough to those values of obtained by the commercial device.

So far, I've looked at a t-test. However, most of the tutorials I see have some sort of causality between the two data sets. What I mean by this is that in most tutorials I see, data set A is taken before some sort of intervention, and data set B is taken after the intervention. In my problem, data from my device can be taken independently from data from the commercial device.

Second, the definition of t-test from Wikipedia is that "It is used to determine whether two sets of data are significantly different from each other." However, what I want is to measure if my two data sets are comparably similar to each other.

So basically, here are my questions.

  1. Is the causality I mentioned necessary for a t-test?
  2. Is a t-test even the proper statistical analysis for this kind of problem?
  3. If it is, how can I make it so that it will measure similarity, not difference. If it is not, could you point me to the right direction?

The sample size is around 20-ish, if that is a relevant detail to this question.

Best Answer

Aside: It sounds like your underlying problem (though not your direct questions) is related to calibration, on which a fair bit has been written. If your device is not as close as you'd like to the commercial one, it may not matter so much, as long as it's fairly consistent in the way it responds. A calibration curve (in most cases, just a line) is often used to adjust readings on devices to match some standard (whence the scale on which the readings are made can be correspondingly adjusted for any such consistent bias). So the methodology of calibration may be of use to you if your device has some bias compared to the commercial one.


Your direct question sounds like you probably want equivalence testing; in particular, a two-one-sided test (TOST) procedure.

The more usual way of setting this up amounts to setting a pair of equivalence bounds around your gold standard measurement (values which are "close enough" to call equivalent) and then showing that you'd reject the hypothesis that the population mean of your measurement would lie above the upper bound and also that it would lie below the lower bound (and so you would conclude it will lie between the bounds).

[This can also be recast as seeing if a two sided confidence interval for the parameter lies entirely within the pair of equivalence bounds.]

See for example Walker & Nowacki (2011) [1]; there's a discussion of TOST in industrial applications in Richter & Richter (2002) [2].

However, a caveat: Presumably you're testing your device not at one value but across the range of the device. Given that there may be more bias at one value than another (indeed, it's possible to be biased low in one place and high in another), you probably want to look at equivalence at each value for the standard device rather than a simple TOST setting (in that case establishing equivalence bands, which may not necessarily be equally wide at every value -- e.g. if equivalence is in percentage terms). This brings us back nearer to the calibration problem I mentioned at the start.

[1]: Walker, E., & Nowacki, A. S. (2011).
Understanding Equivalence and Noninferiority Testing.
Journal of General Internal Medicine, 26(2), 192–196.
http://doi.org/10.1007/s11606-010-1513-8

(Ignore the 'noninferiority' stuff there, you're just after the equivalence part)

[2]: Richter, S. J. & C. Richter (2002),
"A Method for Determining Equivalence in Industrial Applications,"
Quality Engineering, 14(3), 375–380