Wilcoxon signed rank tests seem appropriate since they reflects the fact that measures are taken repeatedly on the same subjects (which increases the tests power to detect real effects) but are modest enough to recognize that grades on writing assignments are only ordinal, but not interval variables (i.e. the difference between a B+ and an A could be much smaller than the difference between an A and an A+ etc.).
Doings the Wilcoxon signed rank tests assumes, that even if the steps between grades are not uniformly high, you as a grader would know for any two evolutions you compare among the students which step is bigger of the two. If you cannot do that, you cannot rank the changes and without a ranking there will not be a rank test. You would be limited to a sign test: basically just counting how many students improved, how many stayed put and how many deteriorated. If there are many more improvements than deteriorations, your test will be significant. Such a test is obviously less powerful since it has no notion of how large any of the improvements were. I do not think you need to use this one. If only you can establish a ranking of improvements, you don't.
If on the other hand your literacy score is much more objectively countable like for example counting the number of mistakes per 100 words (I'm no expert in the field, but you see what I mean with objective I believe), then you can even use paired t-tests. They will have higher statistical power to detect real effects.
When used in the right conditions as described above, the power of the tests to detect existing changes compares as follows:
$$\text{t-test} \geq \text{wilcoxon signed rank test} \geq \text{sign test}$$
In any of the three options, use two sided tests since the possibility that a training session might have deteriorated performance is real. That is what everybody does. Doctors also hope their medication works better than a placebo, but they use two sided tests because it might be even worse than doing nothing. (One sided tests would just make your $\alpha$ level less stringent and are frowned upon.)
Just to be sure, these 3 school classes only exist so that you can get a big enough sample size? You have not chosen the three classes to purposefully represent for example one posh private school, one average school and one underprivileged school? If yes, you will need a more complex statistical methodology to include that information in your analysis as well.
Now to the most important part: The preceding caveats and options notwithstanding, you still need to control your p-value cutoffs for multiple testing. It is very important not to confuse two concepts here:
- your tests are paired on the student level since you observe the same student multiple times (as opposed to observing a different class of 90 students each time after one of your training intervals)
- your tests are pairwise since you compare multiple intermediary situations (as opposed to only comparing the before-after states)
Being paired is taken care of by signed rank tests (or paired t-tests or sign tests), being pairwise requires the following additional precautions:
When the null hypothesis is true and there is no real effect, you still have the possibility of a false discovery proportional to the $\alpha$ cutoff that you compare your p-value to. That is true for every test, so if you do enough tests, you are bound to find some significant results that are false. You need to correct for this inflated chance of false discovery. The easiest way is the Bonferroni correction, just divide your $\alpha$ level by the number of tests. For example, you would be comparing all your p-values against a cutoff of $\alpha/9=0.55\%$ because you perform 9 tests instead of the usual $5\%$ cutoff for a single test. You can see that this is quite a stringent restriction. Holm's method will be a little bit less stringent while still not inflating the chance of false discoveries, it is preferable for that reason.
Practically speaking, are you sure your results are actionable? If you find out for example that the first two training intervals didn't help, but the third and fourth did help, then the fifth and sixth did active harm, the seventh was neutral again and the last two helped, can you translate such mixed results into actionable recommendations? Recommending to skip the intervals 5 and 6 and to put more emphasis on intervals 3 and 4 is only actionable if each interval did something different with the students. If you can explain based on some theory (not only the statistical data) why it could be that some training sessions helped but others didn't, that's insightful. If all the intervals were supposed to do more of the same but didn't, this will be hard to make sense of.
Also, even if you do the 9 intermediary comparisons, you can still also do a before-after comparison. You just need to adjust your cutoffs for one more test (10 instead of 9)
Best Answer
If your data is normally distributed -- you can analyze a number of ways, including a QQ Plot -- then it is fine to run a t-test. But, in order to make the least number of assumptions about the data it is best to use the non-parametric Wilcoxon Signed Rank test.
Due to the fact that you have very few samples (24) I would advise going the Wilcoxon Signed Rank path. I would thoroughly analyze this question because it appears to answer a lot on necessary questions.
Be sure to understand exactly how the type I error and the power behaves in your test.