Solved – Correlational study or ordinal data using 5-point Likert scale

correlationlikertordinal-dataspearman-rho

How do I calculate the correlation using ordinal data based on a 5-point Likert scale associating perioperative education to patient satisfaction scores? A numerical value (1: strongly agree – 5: strongly disagree) will represent the participant's perception of satisfaction as a patient as related to perioperative educational material. The numerical value is the patient's response to 5 specific statements.

The Likert scale is being used during a telephone survey to gather patient responses to 5 statements regarding educational material provided during his or her surgical experience. An example of the statement may appear as, "The day surgery nurse provided clear and easy to understand verbal instruction regarding personal care once at home." The goal is to correlate patient satisfaction with the surgical experience to patient education received by the surgical team of nurses.

Question: What is the best way to analyze this data? Would Spearman's rank correlation be appropriate?

Best Answer

It depends on how much (many?) data you have, how much tolerance for complexity, and how much interest in accuracy. Some will say that that treating everything as continuous is A-OK (generally also assuming normal distributions if correlations are to be interpreted substantively or tested against a null hypothesis), but others insist this is improper, and the latter group is the more technically correct. Since you say this is telephone survey data, it seems plausible that you might have hundreds or even thousands of observations. If so, you may have sufficient statistical power to do things the "right" way.

The right way to begin is by developing a measurement model for your latent variable. Five statements rated on a common five-point Likert scale is just barely enough to satisfy more lenient rules of thumb for deciding whether one can get away with applying classical test theory (CTT) assumptions:

The rating frequencies for any given item are symmetrical around an equivalent mean / median / mode. Hence the ratings approximate a polychotomized normal distribution sufficiently to assume a latent, continuous, normal distribution in the construct being measured by the given item.
All items measure the same, single, continuous, normally distributed, latent construct equally well. Therefore the mean of these items' ratings is the score for this latent factor.

These assumptions imply ways in which CTT can fail:

Extreme responses may be popular on some items, especially if they are unusually easy or hard to agree with. Certain populations may also tend to give more extreme responses in certain contexts; see the first paragraph of my immediately previous answer and its links. If these tendencies don't cancel out across items, your factor score will distribute less normally.
Items may not all relate equally to the latent construct they represent. This is practically guaranteed to some potentially non-negligible degree, unless the latent construct is defined by its items. In that case it might be better to understand it as an emergent construct (e.g., SES), i.e., one caused jointly by its components rather than one that causes variation in manifest indicators. If the mean is an operationalization of the latent construct (not the definition of an emergent construct), and you don't have reason to think that your items contribute systematically unique information to your estimate of the latent factor, then those with lower item-total correlations are probably measuring your latent construct with more error. Poorer indicators shouldn't have the same weight as more reliable indicators in your calculation of the factor score. Using the simple mean of your items fails to adjust item weighting for imbalances in measurement error.
Items may measure more than one construct (collectively or even individually). To whatever extent this is known and can be controlled statistically, it's better to separate systematic variation due to other "nuisance" constructs from the variation used to estimate the latent factor of primary interest. This is another good way to remove measurement error that CTT doesn't utilize. For instance, your example item may measure attitudes toward both the nurse and the instructions. If you had more items involving attitudes toward the nurse (though I doubt you do), you might be able to fit a bifactor model that would estimate how much variance in this item's responses is due to each of these latent attitudinal influences, thus controlling for nurse-specific attitudes.

If you have at least a couple hundred observations, some interest in the statistical process, and want to improve the validity and richness of your results, try fitting a rating scale model to your five items. This is an item response theory model that assumes ratings of all items use the same scale (hence it uses the same threshold estimates for all items) and are influenced by the same latent variable (which is probably all you can estimate with five items). It can be used to generate a continuously distributed estimate of the latent variable that accommodates the ordinal nature of Likert ratings, uses only the common variance in your items (thereby excluding any item-specific measurement error), and weighs the items according to how much each has in common with the others.

You can produce factor scores for individuals using a rating scale model and then correlate those to your other variable, or you can fit a structural equation model that estimates the correlation as well as all of the items' thresholds, loadings, and unique variances, and the entire model's goodness of fit. For more info and some other alternatives, see "Factor analysis of questionnaires composed of Likert items" and "Regression testing after dimension reduction". The best choice will depend on the nature of the other variable you want to correlate your latent construct to, which as far as I can tell, you haven't specified...and again, on how large your sample is. Complex models take more data because they estimate more parameters, but they can provide more valid, precise estimates, and tell you much more about your data. The worst-case scenario would probably be having data that violates CTT assumptions and too little of it to fit the appropriate model, so at least check that this doesn't apply.

Related Solutions

Solved – Designing and validating a Likert-type scale with odd or even response categories

Q1. 4 or 5 point scale (strongly disagree to strongly agree with or without a neutral midpoint)

A1. I think the use of even or odd number of scale points is not a matter that has a definitive answer. There are arguments on both sides of this question. Since you want a yes-no answer, the 4 point scale may be better suited to your purpose than a scale with a neutral mid-point.

Parenthetically, I would suggest that a convincing evaluation additionally would include a detailed evaluation of the actual content of each of the final exam questions such as: "Do you think a minimally competent nurse would answer this question incorrectly?" (This is the type of approach developed by Angoff at ETS many years ago. See, this secondary source, for example.) Global opinions are open to halo and other types of bias.

Q2a. Should the reliability and convergent and discriminant validity be evaluated before the scale is used?

A2a. Inter-rater agreement might be evaluated after the data is collected. If inter-rater agreement is low, the reasons can be probed. If it turns out the survey has to be repeated due to low agreement, it is not a high cost undertaking. Convergent and discriminant validity are evaluated based on correlations. However, in your case, these correlations may be driven more by how hard individual nursing students studied, how bright they are, etc. than by the design of the survey.

Q2b. Should the criterion validity of the scale be evaluated before the scale is used?

A2b. Criterion validity requires a criterion and you do not have one (or you would not be undertaking this survey type evaluation). The best that is usually done in this type of undertaking is content validity. (See for example, the AERA/APA Standards, soon to be revised.)

Q3. Designing the questionnaire around the 21 statements of your national regulatory board.

A3. This is a great idea since it builds on an accepted statement of standards. Do you have alternatives to suggest?

Solved – How to do a correlation between Likert scale and an ordinal categorical measure

What about one of the Kendall's $\tau$s? They are a kind of rank correlation coefficient for ordinal data.

Here's an example with Stata and $\tau_{b}$. A value of $−1$ implies perfect negative association, and $+1$ indicates perfect agreement. Zero indicates the absence of association. Here we see a modest, though significant, negative association between speed limits and accidents.

. webuse hiway, clear
(Minnesota Highway Data, 1973)

. tab spdlimit rate, taub

           |    Accident rate per million
     Speed |          vehicle miles
     Limit |   Below 4        4-7    Above 7 |     Total
-----------+---------------------------------+----------
        40 |         1          0          0 |         1 
        45 |         1          1          1 |         3 
        50 |         1          4          2 |         7 
        55 |        10          4          1 |        15 
        60 |         9          2          0 |        11 
        65 |         1          0          0 |         1 
        70 |         1          0          0 |         1 
-----------+---------------------------------+----------
     Total |        24         11          4 |        39 

          Kendall's tau-b =  -0.4026  ASE = 0.116

You can also try an asymmetric modification of $\tau_{b}$ that only corrects for ties of the independent variable. This is called Somer's D:

. somersd rate spdlimit
Somers' D with variable: rate
Transformation: Untransformed
Valid observations: 39

Symmetric 95% CI
------------------------------------------------------------------------------
             |              Jackknife
        rate |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    spdlimit |  -.4727723   .1395719    -3.39   0.001    -.7463282   -.1992163
------------------------------------------------------------------------------

All these measure of association are related in that they classify all pairs of observations (highways in our example) as concordant or discordant. A pair is concordant if the observation with the larger value of variable $X$ (speed limit) also has the larger value of variable $Y$ (accident rate). There are more of them than you can shake a stick at (one more is Goodman and Kruskal's $\gamma$, which ignores ties altogether like $\tau_{a}$). They will generally yield similar conclusions, even if they are not directly comparable.

The results above are qualitatively in line with Spearman's rank correlation coefficient mentioned by Greg (which tends to be larger in absolute value than $\tau$):

.ci2 rate spdlimit, spearman

Confidence interval for Spearman's rank correlation 
of rate and spdlimit, based on Fisher's transformation.
Correlation = -0.451 on 39 observations (95% CI: -0.671 to -0.158)

This measure does not consider pairs, but compares the similarity of the ordering that you would get if you used each variable separately to rank observations (Stata breaks ties by assigning the average rank, and it's just Pearson correlation on the ranks). This makes it somewhat faster to compute since you don't have to consider all $\frac{n \cdot (n-1)}{2}$ pairs. On the other hand, the central limit theorem works much faster for $\tau$, so if you plan to do inference that measure might be better. $\tau_b$ is the most common variant.

Best Answer

Related Solutions

Solved – Designing and validating a Likert-type scale with odd or even response categories

Solved – How to do a correlation between Likert scale and an ordinal categorical measure

Related Question