Solved – Quantify the difference between two samples

descriptive statistics

My problem is this: I have a need to do some analysis of student test results, and I am seeking some tools in order to do so. What I have are collections of scores for a number of different tests, and a number of different students. So each student may have zero or more scores, and each test may have zero or more results. For some test (let's say Test A) I have 30 scores, distributed across the possible range (1 to 9). For a second collection point I have a similar result set; the scores are distributed differently, but still across the same range.

So, I know I can calculate the mean and look at those. But that doesn't tell me a huge amount. Essentially I want to find (1) some way to compare one collection to another, and (2) a way to quantify it to a numerical value; in this way I could say "the amount improved between collections 1 and 2 was twice as much as that between collections 2 and 3" – or something similar. I've read a bit about effect size, but I'm not sure if it will suit my purpose.

Additionally, each test conforms to a different range and scale; I have found an algorithm that works for me to 'normalize' the scores into a common base. Provided the algorithm is perfectly weighted (for argument's sake), would it be safe to use data from multiple different tests all together as one 'collection' of scores?

To sum it all up, I suppose I am looking for any tools that would be useful in my endeavor to quantify and compare data sets of test scores.

Best Answer

I feel you can fit a linear model and run ANOVA with Tukey's HSD procedure. In R:

# sample data
> t <- structure(list(test = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 
     2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("T1", "T2", "T3"), class = "factor"), 
     student = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 5L, 
     4L, 5L, 3L, 6L, 6L, 6L), .Label = c("S1", "S2", "S3", "S4", 
     "S5", "S6"), class = "factor"), score = c(8L, 6L, 9L, 8L, 
     3L, 5L, 5L, 9L, 1L, 9L, 3L, 1L, 9L, 5L, 3L)), .Names = c("test", 
     "student", "score"), class = "data.frame", row.names = c(NA, -15L))

# fit the model and run ANOVA
> m <- aov(score~test+student,t)
> summary(m)
            Df Sum Sq Mean Sq F value Pr(>F)  
test         2  41.20  20.600   6.341 0.0268 *
student      5  57.66  11.532   3.550 0.0644 .
Residuals    7  22.74   3.249                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is a significant difference among tests (p=0.02683), and almost significant among students (p=0.06444). Now, what tests are different, and how much? Run Tukey's HSD procedure

> hsd <- TukeyHSD(m, which="test", ordered=T) # can also compare students with which="student"
> hsd
   Tukey multiple comparisons of means
    95% family-wise confidence level
    factor levels have been ordered

Fit: aov(formula = score ~ test + student, data = t)

$test
      diff        lwr    upr     p adj
T2-T3  1.4 -1.9571998 4.7572 0.4752507
T1-T3  4.0  0.6428002 7.3572 0.0235300
T1-T2  2.6 -0.7571998 5.9572 0.1245249

The difference between T1 and T3 is statistically significant (p=0.0235300). The difference you got is 4.0, and a 95% confidence interval is [0.6428002, 7.3572]. Between the other tests there is no significant difference.

You can even plot pairwise test comparisons:

plot(hsd)
Related Question