In response to an old question, and given that a good response has been provided already elsewhere by jbowman and StasK to a very similar (but better defined) problem. I refer anyone who stumbles on this to the following question (and answers):
Test for significant difference in ratios of normally distributed random variables
The permutations test should be easy to implement in most statistical tools and many programming languages. Additionally, it doesn't assume that you have count data but means that you can use a ratio of rates or other appropriate metrics.
For ranking by different judges, one can use the Friedman test. http://en.wikipedia.org/wiki/Friedman_test
You may convert ratings from very bad to very good to numerics of -2, -1, 0, 1 and 2. Then put data in long form and apply friedman.test with customer as the blocking factor:
> mm
customer variable value
1 1 product1 2
2 2 product1 1
3 3 product1 0
4 4 product1 2
5 5 product1 -1
6 6 product1 0
7 7 product1 -1
8 8 product1 2
9 9 product1 1
10 10 product1 1
11 11 product1 0
12 12 product1 2
13 13 product1 1
14 14 product1 2
15 15 product1 2
16 1 product2 -2
17 2 product2 -1
18 3 product2 -1
19 4 product2 0
20 5 product2 2
21 6 product2 1
22 7 product2 0
23 8 product2 -2
24 9 product2 1
25 10 product2 2
26 11 product2 0
27 12 product2 1
28 13 product2 1
29 14 product2 0
30 15 product2 0
>
> friedman.test(value~variable|customer, data=mm)
Friedman rank sum test
data: value and variable and customer
Friedman chi-squared = 1.3333, df = 1, p-value = 0.2482
The ranking of the difference between 2 products is not significant.
Edit:
Following is the output of regression:
> summary(lm(value~variable+factor(customer), data=mm))
Call:
lm(formula = value ~ variable + factor(customer), data = mm)
Residuals:
Min 1Q Median 3Q Max
-1.9 -0.6 0.0 0.6 1.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.000e-01 9.990e-01 0.400 0.695
variableproduct2 -8.000e-01 4.995e-01 -1.602 0.132
factor(customer)2 6.248e-16 1.368e+00 0.000 1.000
factor(customer)3 -5.000e-01 1.368e+00 -0.365 0.720
factor(customer)4 1.000e+00 1.368e+00 0.731 0.477
factor(customer)5 5.000e-01 1.368e+00 0.365 0.720
factor(customer)6 5.000e-01 1.368e+00 0.365 0.720
factor(customer)7 -5.000e-01 1.368e+00 -0.365 0.720
factor(customer)8 9.645e-16 1.368e+00 0.000 1.000
factor(customer)9 1.000e+00 1.368e+00 0.731 0.477
factor(customer)10 1.500e+00 1.368e+00 1.096 0.291
factor(customer)11 7.581e-16 1.368e+00 0.000 1.000
factor(customer)12 1.500e+00 1.368e+00 1.096 0.291
factor(customer)13 1.000e+00 1.368e+00 0.731 0.477
factor(customer)14 1.000e+00 1.368e+00 0.731 0.477
factor(customer)15 1.000e+00 1.368e+00 0.731 0.477
Residual standard error: 1.368 on 14 degrees of freedom
Multiple R-squared: 0.3972, Adjusted R-squared: -0.2486
F-statistic: 0.6151 on 15 and 14 DF, p-value: 0.8194
Best Answer
The first thing you will need to think about is what it means (quantitatively) to have "good precision" in such a device. I would suggest that, in a medical context, the goal is to avoid temperature deviations that go into a dangerous range for the patient, so "good precision" is probably going to translate into avoiding dangerously low or high temperatures. This means you are going to be looking for a metric that heavily penalises large deviations from your optimal temperature of 37$^\text{o}$C. In view of this, measurement based on fluctuations in median temperatures is going to be a bad measure of precision, whereas measures that highlight large deviations will be better.
When you are formulating this kind of metric, you are implicitly adopting a "penalty function" that penalises temperatures that deviate from your desired temperature. One option would be to measure "precision" by lower variance around the desired temperature (treating this as the fixed mean for the variance calculation). The variance penalises by squared error, so that gives reasonable penalisation for high deviations. Another option would be to penalise more heavily (e.g., cubed-error). Another option would be to simply measure the amount of time each device has the patient outside the temperature range that is medically safe. In any case, whatever you choose should reflect the perceived dangers of deviation from the desired temperature.
Once you have determined what constitutes a metric of "good precision", you are going to be formulating some kind of "heteroscedasticity test", formulated in the wider sense of allowing whatever measure of precision you are using. I'm not sure I agree with whuber's comment of adjusting for autocorrelation. It really depends on your formulation of loss - after all, staying in a high temperature range for an extended period of time could be exactly the thing that is the most dangerous, so if you adjust back to account for auto-correlation, you might end up failing to penalise highly dangerous outcomes sufficiently.