Q1. 4 or 5 point scale (strongly disagree to strongly agree with or without a neutral midpoint)
A1. I think the use of even or odd number of scale points is not a matter that has a definitive answer. There are arguments on both sides of this question. Since you want a yes-no answer, the 4 point scale may be better suited to your purpose than a scale with a neutral mid-point.
Parenthetically, I would suggest that a convincing evaluation additionally would include a detailed evaluation of the actual content of each of the final exam questions such as: "Do you think a minimally competent nurse would answer this question incorrectly?" (This is the type of approach developed by Angoff at ETS many years ago. See, this secondary source, for example.) Global opinions are open to halo and other types of bias.
Q2a. Should the reliability and convergent and discriminant validity be evaluated before the scale is used?
A2a. Inter-rater agreement might be evaluated after the data is collected. If inter-rater agreement is low, the reasons can be probed. If it turns out the survey has to be repeated due to low agreement, it is not a high cost undertaking. Convergent and discriminant validity are evaluated based on correlations. However, in your case, these correlations may be driven more by how hard individual nursing students studied, how bright they are, etc. than by the design of the survey.
Q2b. Should the criterion validity of the scale be evaluated before the scale is used?
A2b. Criterion validity requires a criterion and you do not have one (or you would not be undertaking this survey type evaluation). The best that is usually done in this type of undertaking is content validity. (See for example, the AERA/APA Standards, soon to be revised.)
Q3. Designing the questionnaire around the 21 statements of your national regulatory board.
A3. This is a great idea since it builds on an accepted statement of standards. Do you have alternatives to suggest?
While it is true that, for the same set of students, the mean pre-post difference will be the same between methods one and two, for statistical inference, all else being equal, the paired method is preferable. A common way to express this is to say that subjects "serve as their own controls." Through this method, one removes a portion of the variance in scores, and removing this portion of variability allows the factor of interest to show up more clearly. It might also seem intuitive that in the paired method, validity is higher because the two sets of scores are more plainly comparable than they would be if the pre- and post- groups were allowed (potentially) to comprise different subsets of students.
With regard to significance testing, the paired method figures to have higher power to detect a significant pre-post difference. The extent to which this is true is a function of the strength of any positive correlation between the two sets of scores. Power will also depend, as in method one, on sample size; variability within each set of scores; reliability of each indicator; alpha; and whether the test is one- or two-tailed.
By the way, if power is a main concern, then using classic scale development methods to combine the 5 indicators into a smaller number of more reliable ones should increase power as well. It's also possible that you would find this to compromise validity and the interpretability of findings, so it's by no means an automatic decision.
Best Answer
What I would do:
Check if this correlates with the teachers answer.
EDIT:
With your dataset it will be impossible to insure that your conclusions are valid for the teachers population, because you can't assume that the three teachers are representative for the whole population. However, if you have more than 20 questions per sheet, the evaluation whether or not there exists a correlation between the answers of these particular teachers and students might be possible.
EDIT: In my experience, one should not take all student's and teacher's answers on evaluation sheets too seriously. Questions are interpreted and the answers do strongly depend on the current mood of the person. Therefore, it might be wise to evaluate only those questions which invoked extrem answers (bad and good) in a second "interpretation" of the data.