You can use a t-test to assess if there are differences in the means. The different sample sizes don't cause a problem for the t-test, and don't require the results to be interpreted with any extra care. Ultimately, you can even compare a single observation to an infinite population with a known distribution and mean and SD; for example someone with an IQ of 130 is smarter than 97.7% of people. One thing to note though, is that for a given $N$ (i.e., *total* sample size), power is maximized if the group $n$'s are equal; with highly unequal group sizes, you don't get as much additional resolution with each additional observation.

To clarify my point about power, here is a very simple simulation written for R:

```
set.seed(9) # this makes the simulation exactly reproducible
power5050 = vector(length=10000) # these will store the p-values from each
power7525 = vector(length=10000) # simulated test to keep track of how many
power9010 = vector(length=10000) # are 'significant'
for(i in 1:10000){ # I run the following procedure 10k times
n1a = rnorm(50, mean=0, sd=1) # I'm drawing 2 samples of size 50 from 2 normal
n2a = rnorm(50, mean=.5, sd=1) # distributions w/ dif means, but equal SDs
n1b = rnorm(75, mean=0, sd=1) # this version has group sizes of 75 & 25
n2b = rnorm(25, mean=.5, sd=1)
n1c = rnorm(90, mean=0, sd=1) # this one has 90 & 10
n2c = rnorm(10, mean=.5, sd=1)
power5050[i] = t.test(n1a, n2a, var.equal=T)$p.value # here t-tests are run &
power7525[i] = t.test(n1b, n2b, var.equal=T)$p.value # the p-values are stored
power9010[i] = t.test(n1c, n2c, var.equal=T)$p.value # for each version
}
mean(power5050<.05) # this code counts how many of the p-values for
[1] 0.7019 # each of the versions are less than .05 &
mean(power7525<.05) # divides the number by 10k to compute the %
[1] 0.5648 # of times the results were 'significant'. That
mean(power9010<.05) # gives an estimate of the power
[1] 0.3261
```

Notice that in all cases $N=100$, but that in the first case $n_1=50$ & $n_2=50$, in the second case $n_1=75$ & $n_2=25$, and in the last case $n_1=90$ and $n_2=10$. Note further that the standardized mean difference / data generating process was the same in all cases. However, whereas the test was 'significant' 70% of the time for the 50-50 sample, power was 56% with 75-25 and only 33% when the group sizes were 90-10.

I think of this by analogy. If you want to know the area of a rectangle, and the perimeter is fixed, then the area will be maximized if the length and width are equal (i.e., if the rectangle is a *square*). On the other hand, as the length and width diverge (as the rectangle becomes elongated), the area shrinks.

Plain and simple: Include the time information when plotting your data and calculating your slope.

Right now, you use the following data for this:

```
1 150446.5
2 150488.5
3 150530.5
4 150613.5
5 150613.5
```

What you should use is however:

```
1.5 150446.5
3.5 150488.5
5.5 150530.5
7.5 150613.5
9.5 150613.5
```

Note how the first time value in each column corresponds to the average of the times corresponding to the values which you used to calculate the respective mean.

## Best Answer

This is largely an issue for you to decide based on your theoretical assumptions about the data and what lies behind them. When you calculate an arithmetic average, you are assuming that the intervals are reasonably similar. (That is, you are implicitly stating that $3-2 = 2-1$ and $3-1 = 2\times (3-2)$.) If

youbelieve that is a reasonable assumption, and others in your field (e.g., reviewers) are likely to agree with you, then it's fine. Using means with ordinal data tends to be more defensible when:It isn't clear to me that those hold in your case, but it is for you to decide.

You also should think hard about what you mean by '"mainly" rated as 2'. Again, that is for you to decide. However, I would

notthink of the set of ratings $\{1,1,2,3,3\}$ as "mainly" being $2$, despite the fact that the mean is $2$. I would interpret that as being a somewhat polarizing word, with some thinking it's 'easy' and some thinking it's 'hard'. But again, this is a theoretical issue for you to decide.For what it's worth (almost certainly very little), if it were me, I would think your ratings were not amenable to be described by means. I think I would interpret '"mainly" rated as 2' as

the majority of raters gave this word a 2. That is, I would select words that received $>50\%\ \rm ``2\!"$s.By contrast, I suspect that you don't only want to select individual words '"mainly" rated as 2', but also want the entire set of selected words to be rated $\approx 2$. To check that aspect, I would feel more comfortable using the mean of all the ratings for all the selected words (or the mean of the words' means). At this point, you are averaging over many more ratings and I think the mean would be more defensible.