Q1. 4 or 5 point scale (strongly disagree to strongly agree with or without a neutral midpoint)
A1. I think the use of even or odd number of scale points is not a matter that has a definitive answer. There are arguments on both sides of this question. Since you want a yes-no answer, the 4 point scale may be better suited to your purpose than a scale with a neutral mid-point.
Parenthetically, I would suggest that a convincing evaluation additionally would include a detailed evaluation of the actual content of each of the final exam questions such as: "Do you think a minimally competent nurse would answer this question incorrectly?" (This is the type of approach developed by Angoff at ETS many years ago. See, this secondary source, for example.) Global opinions are open to halo and other types of bias.
Q2a. Should the reliability and convergent and discriminant validity be evaluated before the scale is used?
A2a. Inter-rater agreement might be evaluated after the data is collected. If inter-rater agreement is low, the reasons can be probed. If it turns out the survey has to be repeated due to low agreement, it is not a high cost undertaking. Convergent and discriminant validity are evaluated based on correlations. However, in your case, these correlations may be driven more by how hard individual nursing students studied, how bright they are, etc. than by the design of the survey.
Q2b. Should the criterion validity of the scale be evaluated before the scale is used?
A2b. Criterion validity requires a criterion and you do not have one (or you would not be undertaking this survey type evaluation). The best that is usually done in this type of undertaking is content validity. (See for example, the AERA/APA Standards, soon to be revised.)
Q3. Designing the questionnaire around the 21 statements of your national regulatory board.
A3. This is a great idea since it builds on an accepted statement of standards. Do you have alternatives to suggest?
What about one of the Kendall's $\tau$s? They are a kind of rank correlation coefficient for ordinal data.
Here's an example with Stata and $\tau_{b}$. A value of $−1$ implies perfect negative association, and $+1$ indicates perfect agreement. Zero indicates the absence of association. Here we see a modest, though significant, negative association between speed limits and accidents.
. webuse hiway, clear
(Minnesota Highway Data, 1973)
. tab spdlimit rate, taub
| Accident rate per million
Speed | vehicle miles
Limit | Below 4 4-7 Above 7 | Total
-----------+---------------------------------+----------
40 | 1 0 0 | 1
45 | 1 1 1 | 3
50 | 1 4 2 | 7
55 | 10 4 1 | 15
60 | 9 2 0 | 11
65 | 1 0 0 | 1
70 | 1 0 0 | 1
-----------+---------------------------------+----------
Total | 24 11 4 | 39
Kendall's tau-b = -0.4026 ASE = 0.116
You can also try an asymmetric modification of $\tau_{b}$ that only corrects for ties of the independent variable. This is called Somer's D:
. somersd rate spdlimit
Somers' D with variable: rate
Transformation: Untransformed
Valid observations: 39
Symmetric 95% CI
------------------------------------------------------------------------------
| Jackknife
rate | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
spdlimit | -.4727723 .1395719 -3.39 0.001 -.7463282 -.1992163
------------------------------------------------------------------------------
All these measure of association are related in that they classify all pairs of observations (highways in our example) as concordant or discordant. A pair is concordant if the observation with the larger value of variable $X$ (speed limit) also has the larger value of variable $Y$ (accident rate). There are more of them than you can shake a stick at (one more is Goodman and Kruskal's $\gamma$, which ignores ties altogether like $\tau_{a}$). They will generally yield similar conclusions, even if they are not directly comparable.
The results above are qualitatively in line with Spearman's rank correlation coefficient mentioned by Greg (which tends to be larger in absolute value than $\tau$):
.ci2 rate spdlimit, spearman
Confidence interval for Spearman's rank correlation
of rate and spdlimit, based on Fisher's transformation.
Correlation = -0.451 on 39 observations (95% CI: -0.671 to -0.158)
This measure does not consider pairs, but compares the similarity of the ordering that you would get if you used each variable separately to rank observations (Stata breaks ties by assigning the average rank, and it's just Pearson correlation on the ranks). This makes it somewhat faster to compute since you don't have to consider all $\frac{n \cdot (n-1)}{2}$ pairs. On the other hand, the central limit theorem works much faster for $\tau$, so if you plan to do inference that measure might be better. $\tau_b$ is the most common variant.
Best Answer
It depends on how much (many?) data you have, how much tolerance for complexity, and how much interest in accuracy. Some will say that that treating everything as continuous is A-OK (generally also assuming normal distributions if correlations are to be interpreted substantively or tested against a null hypothesis), but others insist this is improper, and the latter group is the more technically correct. Since you say this is telephone survey data, it seems plausible that you might have hundreds or even thousands of observations. If so, you may have sufficient statistical power to do things the "right" way.
The right way to begin is by developing a measurement model for your latent variable. Five statements rated on a common five-point Likert scale is just barely enough to satisfy more lenient rules of thumb for deciding whether one can get away with applying classical test theory (CTT) assumptions:
These assumptions imply ways in which CTT can fail:
If you have at least a couple hundred observations, some interest in the statistical process, and want to improve the validity and richness of your results, try fitting a rating scale model to your five items. This is an item response theory model that assumes ratings of all items use the same scale (hence it uses the same threshold estimates for all items) and are influenced by the same latent variable (which is probably all you can estimate with five items). It can be used to generate a continuously distributed estimate of the latent variable that accommodates the ordinal nature of Likert ratings, uses only the common variance in your items (thereby excluding any item-specific measurement error), and weighs the items according to how much each has in common with the others.
You can produce factor scores for individuals using a rating scale model and then correlate those to your other variable, or you can fit a structural equation model that estimates the correlation as well as all of the items' thresholds, loadings, and unique variances, and the entire model's goodness of fit. For more info and some other alternatives, see "Factor analysis of questionnaires composed of Likert items" and "Regression testing after dimension reduction". The best choice will depend on the nature of the other variable you want to correlate your latent construct to, which as far as I can tell, you haven't specified...and again, on how large your sample is. Complex models take more data because they estimate more parameters, but they can provide more valid, precise estimates, and tell you much more about your data. The worst-case scenario would probably be having data that violates CTT assumptions and too little of it to fit the appropriate model, so at least check that this doesn't apply.