Solved – Parametric tests and Likert Scales (Ordinal data) – Two different views

assumptionslikertordinal-data

Following articles reach quite different conclusion and I start to believe that there is no clear answer to this problem. The conclusions are below and the first author reacts on second author.
My question here is, what approach is appropriate (given following situation) when we want to analyse Likert Scales in Social research, MANOVA fits our research design (two or more DV based on Likert Scale), we have N = 180 but these two contradictory opinions?

First Article:

Parametric statistics can be used with Likert data, with small sample
sizes, with unequal variances, and with non-normal distributions, with
no fear of ‘‘coming to the wrong conclusion’’. These findings are
consistent with empirical literature dating back nearly 80 years. The
controversy can cease (but likely won’t).

Norman, Geoff. "Likert scales, levels of measurement and the “laws” of statistics." Advances in health sciences education 15.5 (2010): 625-632.
10.1007/s10459-010-9222-y

Second article:

(…)the researcher should decide what level of measurement is in use
(to paraphrase, if it is an interval level, for a score of 3, one
should be able to answer the question "3 what?"); non-parametric tests
should be employed if the data is clearly ordinal, and if the
researcher is confident that the data can justifiably be classed as
interval, attention should nevertheless be paid to the sample size and
to whether the distribution is normal.

Jamieson, S. (2004). Likert scales: how to (ab)use them. Medical education, 38(12), 1217-1218.
DOI: 10.1111/j.1365-2929.2004.02012.x

Best Answer

One way I approach this is to not take people's word for it, based on what appears to be either their beliefs, or precedent, but to try it out and see if (in your case) it matters in a way that you care about.

Here's a simple example: A 5 point Likert scale, with a uniform distribution. 100 people per group, and we'll do a two sample t-test. I'll repeat this 10000 times when the null hypothesis is true (i.e. there is no difference).

> mean(sapply(1:1000, function(x) { 
    t.test(sample(1:5, 100, TRUE), sample(1:5, 100, TRUE))$p.value 
  } ) < 0.05)

[1] 0.0499

It appears that I get a significant value 4.99% of the time. Given that I expect a significant value 5% of the time, it does not appear that violating the assumptions of normality and interval measurement has had any effect on my results - at least in terms of type I errors. (There might be power issues, of course.)

If someone has a specific criticism, you can investigate and see if it's an issue.

Here's another example: Now I have 5 people in one group, and 100 in the other.

>   mean(sapply(1:10000, function(x) { t.test(sample(1:5, 5, TRUE), sample(1:5, 100, TRUE))$p.value } ) < 0.05)
[1] 0.0733

Now I have a 7.3% type I error rate. This is probably enough to worry about.

What about 5 per group?

 mean(sapply(1:10000, function(x) { t.test(sample(1:5, 5, TRUE), sample(1:5, 5, TRUE))$p.value } ) < 0.05)

Now a 4.5% signifance rate - indicates a slight loss of power, but I prefer that (a lot) over an inflated type I error rate.

Related Solutions

Likert Scales – Conditions for Using Likert Scales as Ordinal or Interval Data

Maybe too late but I add my answer anyway...

It depends on what you intend to do with your data: If you are interested in showing that scores differ when considering different group of participants (gender, country, etc.), you may treat your scores as numeric values, provided they fulfill usual assumptions about variance (or shape) and sample size. If you are rather interested in highlighting how response patterns vary across subgroups, then you should consider item scores as discrete choice among a set of answer options and look for log-linear modeling, ordinal logistic regression, item-response models or any other statistical model that allows to cope with polytomous items.

As a rule of thumb, one generally considers that having 11 distinct points on a scale is sufficient to approximate an interval scale (for interpretation purpose, see @xmjx's comment)). Likert items may be regarded as true ordinal scale, but they are often used as numeric and we can compute their mean or SD. This is often done in attitude surveys, although it is wise to report both mean/SD and % of response in, e.g. the two highest categories.

When using summated scale scores (i.e., we add up score on each item to compute a "total score"), usual statistics may be applied, but you have to keep in mind that you are now working with a latent variable so the underlying construct should make sense! In psychometrics, we generally check that (1) unidimensionnality of the scale holds, (2) scale reliability is sufficient. When comparing two such scale scores (for two different instruments), we might even consider using attenuated correlation measures instead of classical Pearson correlation coefficient.

Classical textbooks include:
1. Nunnally, J.C. and Bernstein, I.H. (1994). Psychometric Theory (3rd ed.). McGraw-Hill Series in Psychology.
2. Streiner, D.L. and Norman, G.R. (2008). Health Measurement Scales. A practical guide to their development and use (4th ed.). Oxford.
3. Rao, C.R. and Sinharay, S., Eds. (2007). Handbook of Statistics, Vol. 26: Psychometrics. Elsevier Science B.V.
4. Dunn, G. (2000). Statistics in Psychiatry. Hodder Arnold.

You may also have a look at Applications of latent trait and latent class models in the social sciences, from Rost & Langeheine, and W. Revelle's website on personality research.

When validating a psychometric scale, it is important to look at so-called ceiling/floor effects (large asymmetry resulting from participants scoring at the lowest/highest response category), which may seriously impact on any statistics computed when treating them as numeric variable (e.g., country aggregation, t-test). This raises specific issues in cross-cultural studies since it is known that overall response distribution in attitude or health surveys differ from one country to the other (e.g. chinese people vs. those coming from western countries tend to highlight specific response pattern, the former having generally more extreme scores at the item level, see e.g. Song, X.-Y. (2007) Analysis of multisample structural equation models with applications to Quality of Life data, in Handbook of Latent Variable and Related Models, Lee, S.-Y. (Ed.), pp 279-302, North-Holland).

More generally, you should look at the psychometric-related literature which makes extensive use of Likert items if you are interested with measurement issue. Various statistical models have been developed and are currently headed under the Item Response Theory framework.

Solved – Correct Analysis of Likert score (Product of two Likert scales)

I don't think you should be multiplying them in this way without a lot of thought.

In fact, I'd go further and say that you shouldn't even ask the question this way. Rather, it would be better to have people rate the risk of each type of injury. After all, if a person falls 10 feet, he MIGHT have no injuries, he MIGHT die - there are certainly examples of both. So, I might say there is a very slight chance of no injuries, a much higher chance of the middle 3 levels and a very slight chance of fatality. And "possible fatal" is a bad choice of words. What does "very likely possible fatal injuries" mean?

If you've already gathered data ..... well... Clearly "no injuries" and "never" should be 0. not 1. Then you need to consider each combination and whether they are equivalent. Is "unlikely fatal" (2*5 = 10) the same as "very likely slight (5*2 = 10)? I don't think so....

The second scale could probably be made numeric fairly easily: Never = 0, Very likely = 0.9 and the others are intermediate with some reasonable choices. The first scale will be very hard to make numeric. I would do sensitivity analysis with different choices.

Then, you don't avoid any of these issues by making it ordinal. If anything, you make them worse. By doing this you are saying that ALL the combinations in a particular ordinal level are the same. So, your 10-15 category includes:

Slight Very Likely
No permanent Very likely
Fatal Unlikely
Fatal Somewhat likely
No permanent Likely
Permanent Somewhat likely

That can't be right!

Another choice, if you've already got data, is to not multiply the values at all but to use them as separate independent variables; but I am not sure that gets at what you want.

Best Answer

Related Solutions

Likert Scales – Conditions for Using Likert Scales as Ordinal or Interval Data

Solved – Correct Analysis of Likert score (Product of two Likert scales)

Related Question