It's important to distinguish, as pointed out by Nick Cox, between iV and dV. As far as dV is concerned, why not use a ordinal regression model, as discussed excellently e.g. by Agresti: http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470082895.html
I am less sure about the iV case. Standard would perhaps use dummy coding. I suppose this is what Frank Harrell means. Maybe Agresti discusses this as well.
One way I approach this is to not take people's word for it, based on what appears to be either their beliefs, or precedent, but to try it out and see if (in your case) it matters in a way that you care about.
Here's a simple example: A 5 point Likert scale, with a uniform distribution. 100 people per group, and we'll do a two sample t-test. I'll repeat this 10000 times when the null hypothesis is true (i.e. there is no difference).
> mean(sapply(1:1000, function(x) {
t.test(sample(1:5, 100, TRUE), sample(1:5, 100, TRUE))$p.value
} ) < 0.05)
[1] 0.0499
It appears that I get a significant value 4.99% of the time. Given that I expect a significant value 5% of the time, it does not appear that violating the assumptions of normality and interval measurement has had any effect on my results - at least in terms of type I errors. (There might be power issues, of course.)
If someone has a specific criticism, you can investigate and see if it's an issue.
Here's another example: Now I have 5 people in one group, and 100 in the other.
> mean(sapply(1:10000, function(x) { t.test(sample(1:5, 5, TRUE), sample(1:5, 100, TRUE))$p.value } ) < 0.05)
[1] 0.0733
Now I have a 7.3% type I error rate. This is probably enough to worry about.
What about 5 per group?
mean(sapply(1:10000, function(x) { t.test(sample(1:5, 5, TRUE), sample(1:5, 5, TRUE))$p.value } ) < 0.05)
Now a 4.5% signifance rate - indicates a slight loss of power, but I prefer that (a lot) over an inflated type I error rate.
Best Answer
Maybe too late but I add my answer anyway...
It depends on what you intend to do with your data: If you are interested in showing that scores differ when considering different group of participants (gender, country, etc.), you may treat your scores as numeric values, provided they fulfill usual assumptions about variance (or shape) and sample size. If you are rather interested in highlighting how response patterns vary across subgroups, then you should consider item scores as discrete choice among a set of answer options and look for log-linear modeling, ordinal logistic regression, item-response models or any other statistical model that allows to cope with polytomous items.
As a rule of thumb, one generally considers that having 11 distinct points on a scale is sufficient to approximate an interval scale (for interpretation purpose, see @xmjx's comment)). Likert items may be regarded as true ordinal scale, but they are often used as numeric and we can compute their mean or SD. This is often done in attitude surveys, although it is wise to report both mean/SD and % of response in, e.g. the two highest categories.
When using summated scale scores (i.e., we add up score on each item to compute a "total score"), usual statistics may be applied, but you have to keep in mind that you are now working with a latent variable so the underlying construct should make sense! In psychometrics, we generally check that (1) unidimensionnality of the scale holds, (2) scale reliability is sufficient. When comparing two such scale scores (for two different instruments), we might even consider using attenuated correlation measures instead of classical Pearson correlation coefficient.
Classical textbooks include:
1. Nunnally, J.C. and Bernstein, I.H. (1994). Psychometric Theory (3rd ed.). McGraw-Hill Series in Psychology.
2. Streiner, D.L. and Norman, G.R. (2008). Health Measurement Scales. A practical guide to their development and use (4th ed.). Oxford.
3. Rao, C.R. and Sinharay, S., Eds. (2007). Handbook of Statistics, Vol. 26: Psychometrics. Elsevier Science B.V.
4. Dunn, G. (2000). Statistics in Psychiatry. Hodder Arnold.
You may also have a look at Applications of latent trait and latent class models in the social sciences, from Rost & Langeheine, and W. Revelle's website on personality research.
When validating a psychometric scale, it is important to look at so-called ceiling/floor effects (large asymmetry resulting from participants scoring at the lowest/highest response category), which may seriously impact on any statistics computed when treating them as numeric variable (e.g., country aggregation, t-test). This raises specific issues in cross-cultural studies since it is known that overall response distribution in attitude or health surveys differ from one country to the other (e.g. chinese people vs. those coming from western countries tend to highlight specific response pattern, the former having generally more extreme scores at the item level, see e.g. Song, X.-Y. (2007) Analysis of multisample structural equation models with applications to Quality of Life data, in Handbook of Latent Variable and Related Models, Lee, S.-Y. (Ed.), pp 279-302, North-Holland).
More generally, you should look at the psychometric-related literature which makes extensive use of Likert items if you are interested with measurement issue. Various statistical models have been developed and are currently headed under the Item Response Theory framework.