i'm assuming that the purpose of your analysis is to obtain evidence for the validity of your scale/instrument. so, first of all, your instrument was designed based on 4 hypothesized constructs, therefore, you should approach this using confirmatory factor analysis (CFA). exploratory factor analysis (EFA) is appropriate when there is no a priori theory describing the relationship between observed variables (i.e., items) and constructs and can result in uninterpretable factors, as you see here.
then examine the results of your CFA model. the various fit statistics (e.g., X^2, RMSEA, modification indices, wald test statistics) can guide you through the refinement of your model.
if you prefer a more exploratory approach, also consider "backward search":
Chou C-P, Bentler, P.M. (2002). Model modification in structural equation modeling by imposing constraints, Computational Statistics & Data Analysis, 41, (2), 271-287.
I don't have any citations, but here's what I'd suggest:
Zeroth: If at all possible, split the data into a training and test set.
First do EFA. Look at various solutions to see which ones make sense, based on your knowledge of the questions. You'd have to do this before Cronbach's alpha, or you won't know which items go into which factor. (Running alpha on ALL the items is probably not a good idea).
Next, run alpha and delete items that have much poorer correlations than the others in each factor. I wouldn't set an arbitrary cutoff, I'd look for ones that were much lower than the others. See if deleting those makes sense.
Finally, choose items with a variety of "difficulty" levels from IRT.
Then, if possible, redo this on the test set, but without doing any exploring. That is, see how well the result found on the training set works on the test set.
Best Answer
From what I've seen so far, FA is used for attitude items as it is for other kind of rating scales. The problem arising from the metric used (that is, "are Likert scales really to be treated as numeric scales?" is a long-standing debate, but providing you check for the bell-shaped response distribution you may handle them as continuous measurements, otherwise check for non-linear FA models or optimal scaling) may be handled by polytmomous IRT models, like the Graded Response, Rating Scale, or Partial Credit Model. The latter two may be used as a rough check of whether the threshold distances, as used in Likert-type items, are a characteristic of the response format (RSM) or of the particular item (PCM).
Regarding your second point, it is known, for example, that response distributions in attitude or health surveys differ from one country to the other (e.g. chinese people tend to highlight 'extreme' response patterns compared to those coming from western countries, see e.g. Song, X.-Y. (2007) Analysis of multisample structural equation models with applications to Quality of Life data, in Handbook of Latent Variable and Related Models, Lee, S.-Y. (Ed.), pp 279-302, North-Holland). Some methods to handle such situation off the top of my head:
Now, the point is that most of these approaches focus at the item level (ceiling/floor effect, decreased reliability, bad item fit statistics, etc.), but when one is interested in how people deviate from what would be expected from an ideal set of observers/respondents, I think we must focus on person fit indices instead.
Such $\chi^2$ statistics are readily available for IRT models, like INFIT or OUTFIT mean square, but generally they apply on the whole questionnaire. Moreover, since estimation of items parameters rely in part on persons parameters (e.g., in the marginal likelihood framework, we assume a gaussian distribution), the presence of outlying individuals may lead to potentially biased estimates and poor model fit.
As proposed by Eid and Zickar (2007), combining a latent class model (to isolate group of respondents, e.g. those always answering on the extreme categories vs. the others) and an IRT model (to estimate item parameters and persons locations on the latent trait in both groups) appears a nice solution. Other modeling strategies are described in their paper (e.g. HYBRID model, see also Holden and Book, 2009).
Likewise, unfolding models may be used to cope with response style, which is defined as a consistent and content-independent pattern of response category (e.g. tendency to agree with all statements). In the social sciences or psychological literature, this is know as Extreme Response Style (ERS). References (1–3) may be useful to get an idea on how it manifests and how it may be measured.
Here is a short list of papers that may help to progress on this subject: