Your method #1 loses information by dichotomizing in two different ways. I'd instead look at each item's correlation with the sum of all the other items (in software such as SPSS this is called "Corrected Item-Total Correlation"). For #2, where you have done something close to this, you could make a case for either Spearman's or Pearson's, and they'll hardly differ since with a 1-5 per-item range there shouldn't be many extreme outliers. You'll have to establish your own threshold, I'm afraid: how exacting do you want to be? How desirable is it to preserve a large number of items for your scales? And how concerned are you about your case-to-item ratio?
As for your questions about factor analysis, yes, to build the scales based on empirical criteria can be defensible, just as it can be to do so using a priori ideas about which item belongs to which dimension. Good research will hopefully reconcile any conflicts between the two. Items with multiple high loadings are a problem if you want uncorrelated factors, something that is often unrealistic in opinion research. At a more general level, I think you have some sense that factor analysis and scale development is best seen as a largely creative process where there are many subjective decisions to be made and often much work to do in justifying them!
Q1. 4 or 5 point scale (strongly disagree to strongly agree with or without a neutral midpoint)
A1. I think the use of even or odd number of scale points is not a matter that has a definitive answer. There are arguments on both sides of this question. Since you want a yes-no answer, the 4 point scale may be better suited to your purpose than a scale with a neutral mid-point.
Parenthetically, I would suggest that a convincing evaluation additionally would include a detailed evaluation of the actual content of each of the final exam questions such as: "Do you think a minimally competent nurse would answer this question incorrectly?" (This is the type of approach developed by Angoff at ETS many years ago. See, this secondary source, for example.) Global opinions are open to halo and other types of bias.
Q2a. Should the reliability and convergent and discriminant validity be evaluated before the scale is used?
A2a. Inter-rater agreement might be evaluated after the data is collected. If inter-rater agreement is low, the reasons can be probed. If it turns out the survey has to be repeated due to low agreement, it is not a high cost undertaking. Convergent and discriminant validity are evaluated based on correlations. However, in your case, these correlations may be driven more by how hard individual nursing students studied, how bright they are, etc. than by the design of the survey.
Q2b. Should the criterion validity of the scale be evaluated before the scale is used?
A2b. Criterion validity requires a criterion and you do not have one (or you would not be undertaking this survey type evaluation). The best that is usually done in this type of undertaking is content validity. (See for example, the AERA/APA Standards, soon to be revised.)
Q3. Designing the questionnaire around the 21 statements of your national regulatory board.
A3. This is a great idea since it builds on an accepted statement of standards. Do you have alternatives to suggest?
Best Answer
There should be other considerations on your Likert scales than just the number of categories.
Do you offer a neutral category? Compare: "Strongly disagree - Disagree - Agree - Strongly Agree" vs. "Strongly disagree - Disagree - No opinion - Agree - Strongly Agree". The first scale has a by-product of forced response, which may or may not be appropriate.
Do you label them with numbers? If you do, is the neutral category a zero? Compare: "1 Don't like it at all / 2 / 3 / 4 / 5 Like it a lot" vs "-2 Don't like it at all / -1 / 0 / 1 / 2 Like it a lot". The latter one does have the swing from negative to positive attitude, while the former one does not.
If you provide text with the categories, are they equidistant? Netflix 5-star scale sucks, in my opinion: it has only 1 star for "Don't like"s, and between 3 and 5 stars for various degree of "Like"s. Our department teaching evaluations were like that, too: 1 for "Poor", 2 for "Adequate", 3 for "Good", 4 for "Excellent", 5 for "Outstanding", and you basically had to score 4 and above. That's where most of the inconsistencies in validation will likely come from, as the distance between 4 and 5 is not nearly the same as between 1 and 2.
Update: As far as reliability and validity are concerned, I am not sure as to what the standard practices are regarding Likert scales. You can probably present the analysis of both the moment covariance and polychoric correlation matrices, to demonstrate the factor structure, reliability of individual items, and composite reliability of the factor. The moment-based analysis will understate the reliability of the underlying continuous scales, as much work has shown; the polychoric correlations will get these continuous scales right, but that's not what you are measuring. So the true reliability of your measurement process is somewhere in between. You can also demonstrate discriminant validity with these internal measurements (different factors correspond to different concepts, and hence their correlation is less than 1). To demonstrate external validity in its strongest form, you would need some additional behavioral variables coming from a substantive model. E.g., if a certain physical activity is "difficult" to do in the old age, you would expect it to be done "rarely".