I think there are several challenges to consider.
In terms of how to visualize, the most accurate would be to use a mosaic plot, or a stacked barplot (which are practically the same in this case, but it might be easier to find a stacked barplot in excel or SPSS than the mosaic plot).
It might also be helpful to change the likert scale to a numerical (1-5) scale, and have a boxplot of each of the 4 categories of your second question. Since boxplots are based on percentiles, the meaning of the boxplot can be somewhat consistent (depending on how the quantiles are calculated when dealing with mid points) with the type of data you present.
In terms of how to analyse, there are different questions you can ask. The simplest will be "is there a correlation between the two?", that can easily be answered using the pearson correlation on the ranking of the numerical values of your scales. This correlation will actually give you the Spearman correlation measure (the correlation of the ranks). The ranking is important for cases where you will have ties (for example, the vector: 1,2,2,4 should actually become: 1,2.5,2.5,3).
The wilcoxon test is relevant if you want to answer the question if the ranks of one measure is different than the other measure. But from your question, it doesn't sound like an interesting question. You can also use the Chi-square test for a similar question, but it's power will probably be smaller.
Q1. 4 or 5 point scale (strongly disagree to strongly agree with or without a neutral midpoint)
A1. I think the use of even or odd number of scale points is not a matter that has a definitive answer. There are arguments on both sides of this question. Since you want a yes-no answer, the 4 point scale may be better suited to your purpose than a scale with a neutral mid-point.
Parenthetically, I would suggest that a convincing evaluation additionally would include a detailed evaluation of the actual content of each of the final exam questions such as: "Do you think a minimally competent nurse would answer this question incorrectly?" (This is the type of approach developed by Angoff at ETS many years ago. See, this secondary source, for example.) Global opinions are open to halo and other types of bias.
Q2a. Should the reliability and convergent and discriminant validity be evaluated before the scale is used?
A2a. Inter-rater agreement might be evaluated after the data is collected. If inter-rater agreement is low, the reasons can be probed. If it turns out the survey has to be repeated due to low agreement, it is not a high cost undertaking. Convergent and discriminant validity are evaluated based on correlations. However, in your case, these correlations may be driven more by how hard individual nursing students studied, how bright they are, etc. than by the design of the survey.
Q2b. Should the criterion validity of the scale be evaluated before the scale is used?
A2b. Criterion validity requires a criterion and you do not have one (or you would not be undertaking this survey type evaluation). The best that is usually done in this type of undertaking is content validity. (See for example, the AERA/APA Standards, soon to be revised.)
Q3. Designing the questionnaire around the 21 statements of your national regulatory board.
A3. This is a great idea since it builds on an accepted statement of standards. Do you have alternatives to suggest?
Best Answer
First, we'll need to know whether you are interested in the response to each Likert question or to a sum of Likert questions; if the latter, it matters how many questions and what the distribution of the scale looks like.
Either way, you will have to account for the nonindependence of the data, because the same people are answering the questions multiple times. Repeated measures ANOVA is one solution to this, but it makes unrealistic assumptions including sphericity, and would only be usable for the scale score, and only if the scores ranged fairly widely so that you could pretend they were continuous.
A better option is a mixed model. If you treat the scores as continuous data, then this would be a linear mixed model; if you treat them as ordinal (as you would have to do if you were interested in each question) then you would need a nonlinear mixed model.
Unfortunately, these models are not simple to implement. If you currently know only about t-tests, then you may need to hire a consultant to help.