What is the rationale of applying an exploratory/unsupervised method (PCA or FA with VARIMAX rotation) after having tested a confirmatory model, especially if this is done on the same sample?
In your CFA model, you impose constraints on your pattern matrix, e.g. some items are supposed to load on one factor but not on the others. A large modification index indicates that freeing a parameter or removing an equality constraint could result in better model fit. Item loadings are already available through your model fit.
On the contrary, in PCA or FA there is no such constraint, even following an orthogonal rotation (whose purpose is just to make factor more interpretable in that items would generally tend to load more heavily on a factor than on several ones). But, it is worth noting that these models are conceptually and mathematically different: the FA model is a measurement model, where we assume that there is some unique error attached to each item; this is not the case under the PCA framework. It is thus not surprising that you failed to replicate your factor structure, which may be an indication that there are possible item cross-loading, low item reliability, low stability in your factor structure, or the existence of a higher-order factor structure, that is enhanced by your low sample size.
In both case, but especially CFA, $N=96$ is a very limited sample size. Although some authors have suggested a ratio individuals:items of 5 to 10, this is merely the number of dimensions that is important. In your case, the estimation of your parameters will be noisy, and in the case of PCA you may expect fluctuations in your estimated loadings (just try bootstrap to get an idea of 95% CIs).
Your method #1 loses information by dichotomizing in two different ways. I'd instead look at each item's correlation with the sum of all the other items (in software such as SPSS this is called "Corrected Item-Total Correlation"). For #2, where you have done something close to this, you could make a case for either Spearman's or Pearson's, and they'll hardly differ since with a 1-5 per-item range there shouldn't be many extreme outliers. You'll have to establish your own threshold, I'm afraid: how exacting do you want to be? How desirable is it to preserve a large number of items for your scales? And how concerned are you about your case-to-item ratio?
As for your questions about factor analysis, yes, to build the scales based on empirical criteria can be defensible, just as it can be to do so using a priori ideas about which item belongs to which dimension. Good research will hopefully reconcile any conflicts between the two. Items with multiple high loadings are a problem if you want uncorrelated factors, something that is often unrealistic in opinion research. At a more general level, I think you have some sense that factor analysis and scale development is best seen as a largely creative process where there are many subjective decisions to be made and often much work to do in justifying them!
Best Answer
This is quite common and I have seen this many times before. The reverse-coded items share commonalities because they all share a similar methodological detail. If you measured the same construct using a self-report questionnaire and a physiological measure, for example, you would find that the self-report and physiological indices load on different factors because they are different methods, despite measuring the same construct. I would simply explain the reverse-worded factor as it is, a factor comprised of the items that are reverse-worded because these items share a similar response pattern.
Edit: I will also mention that reverse-coding items should not matter. If the items are reverse-worded, while the rest of the items are worded straightforwardly, the items that are reverse-worded will likely correlate strongly simply due to their reverse-worded nature.