Q1. 4 or 5 point scale (strongly disagree to strongly agree with or without a neutral midpoint)
A1. I think the use of even or odd number of scale points is not a matter that has a definitive answer. There are arguments on both sides of this question. Since you want a yes-no answer, the 4 point scale may be better suited to your purpose than a scale with a neutral mid-point.
Parenthetically, I would suggest that a convincing evaluation additionally would include a detailed evaluation of the actual content of each of the final exam questions such as: "Do you think a minimally competent nurse would answer this question incorrectly?" (This is the type of approach developed by Angoff at ETS many years ago. See, this secondary source, for example.) Global opinions are open to halo and other types of bias.
Q2a. Should the reliability and convergent and discriminant validity be evaluated before the scale is used?
A2a. Inter-rater agreement might be evaluated after the data is collected. If inter-rater agreement is low, the reasons can be probed. If it turns out the survey has to be repeated due to low agreement, it is not a high cost undertaking. Convergent and discriminant validity are evaluated based on correlations. However, in your case, these correlations may be driven more by how hard individual nursing students studied, how bright they are, etc. than by the design of the survey.
Q2b. Should the criterion validity of the scale be evaluated before the scale is used?
A2b. Criterion validity requires a criterion and you do not have one (or you would not be undertaking this survey type evaluation). The best that is usually done in this type of undertaking is content validity. (See for example, the AERA/APA Standards, soon to be revised.)
Q3. Designing the questionnaire around the 21 statements of your national regulatory board.
A3. This is a great idea since it builds on an accepted statement of standards. Do you have alternatives to suggest?
Many psychological tests convert numeric raw scores into categories. For example, Wikipedia mentions cut-offs for the Beck Depression Inventory:
- 0–9: indicates minimal depression
- 10–18: indicates mild depression
- 19–29: indicates moderate depression
- 30–63: indicates severe depression.
Or for example the BMI define various cut-offs (e.g., Cole et al, 2007).
In general, you lose information by collapsing categories or using cut-offs. Psychological reality tends to be more continuous. That said, categories do have heuristic value as decision aides.
A few options for converting scores to a collapsed set of categories
- Use the logical definition of the scale points: For example, you might use "strongly agree" as highly productive, "agree" as productive, and the other categories as "not productive". This is a simple approach that uses the scale anchor points to define the meaning of the categories.
- Use expert judgements: You can ask a set of experts to evaluate where they thinlk the cut-offs between categories should be. These can then be synthesised. This approach is often used to define acceptable standards for various tests.
- Use normative information: You could use information about the normative spread of the variable and an assumption about the prevalence of the phenomena to define cut-offs.
- Use prediction of external criterion: If the thing has objective existence, or if there are things related to it, you could use predictive models of this external criterion to define the categories.
References
- Cole, T. J., Flegal, K. M., Nicholls, D., & Jackson, A. A. (2007). Body mass index cut offs to define thinness in children and adolescents: international survey. Bmj, 335(7612), 194. FULL TEXT
Best Answer
Given that you have used a likert scale, comparig means is a plausible way of seeing if the two apps are different, but because you only have 5 people in each group, you shouldn't use the t test because you don't have a large enough sample to assume that the mean will be normally distributed.
If the participants rating both apps then your best bet is a Wilcoxon signed-rank test which is like a paired t test but with looser assumptions.
If you had wanted to do a ranking, then you could have told the participants to rank the two, forcing them to choose which one they prefer. You can then simply compare that to a 50/50 even split which would have been what would have happened had they chosen randomly.
Some comments if I may: With 15 items, chances are you're going to get something that falsely appears significant. I genuinely think you've got to select only like 1 item that is the best comparison between the two apps and run the test on that. There are non parametric multivariate tests (i.e. they let you compare lots of means all at once, while having only a small sample), but it gets pretty complicated. You might not even need to hypothesis test honestly, why not just use some summary statistics.